ROS Resources: Documentation | Support | Discussion Forum | Service Status | Q&A answers.ros.org

ROS2 Foxy & RMW Fast DDS: Improved Intra-process & Inter-process performance

We announced this week our new Fast DDS 2.1 release for Foxy, and I promised some data about one of the new features: The improved intra-process & inter-process performance.

Fast DDS was really fast already, but these last months we have been making some modifications because the upcoming Zero Copy shared memory transport for the end of this year. As a result, Fast DDS 2.1 is now striking fast, and the larger the data, the bigger the improvement.

Here you can see several graphs showing latency and throughput for the RMW_FastRTPS, in both single process and two process configurations, comparing Sync and Async alternatives with the prior version of Fast DDS

Let’s start with intra-process

In the next graph, we try to send 1000 messages in one second of different sizes, to measure the end-to-end processing capacity and get how many messages by second RMW Fast RTPS is able to handle.

As you can see, for large data the improvement is around 10X faster in terms of latency, and we increase substantially the throughput, around 50%.

And now, inter-process (two processes):

Similar results, around 10X faster in terms of latency, and we increase substantially the throughput, around 50%.

4 Likes

Congratulations, looks great ! The intra-process and inter-process latencies do not differ that much for the 2.1.0_sync mode. So FastDDS seems to use the same data path doesn’t matter if it is intra- or inter-process, right ?

Is there a need to copy memory for intra-process communication in sync mode at all or wouldn’t it make sense to just “forward the allocated memory” to the connected subscriber ? Maybe that doesn’t matter for the next “zero copy” release anymore.

That looks great, thanks. Also looking forward to hearing about the zero-copy improvements that are coming.

Are there any more improvements on the pipeline for intraprocess communication? Specially addressing lock-free publication/subscription.

We’ve been looking into different alternatives for publishing data in a realtime system at 2KHz and all DDS implementations tried, the fastest we could get was 0.01ms latency for small Packets of data (256Bytes). which was 3x times as much as using a lock free circular buffer or a RT focused system like Orocos.

We understand going below 0.01ms might not be a typical use case for DDS, but having lock-free and deterministic publication and subscription time would open up the possibility of publishing directly from RT threads without affecting the determinism of the rest of the loop.

Thank you for your compliments @rex-schilasky

The intra-process and inter-process latencies do not differ that much for the 2.1.0_sync mode

It may not be clear in the graphs, but the results do have a substantial difference from intra-process to inter-process. For the last dot in the lines, that correspond to the PointCloud8m data-type, latency compares 2.65 vs 3.68. But the important thing is that, apart from having slightly better latency, the number of samples transmitted compares 730 vs 208

Is there a need to copy memory for intra-process communication in sync mode at all or wouldn’t it make sense to just “forward the allocated memory” to the connected subscriber?

This is exactly what 2.1.0 is bringing in. When the publisher and the subscriber are on the same process, payloads are not copied, and reference counts are updated instead.

Maybe that doesn’t matter for the next “zero copy” release anymore.

The zero-copy mechanism mainly involves API extensions allowing the user to get pointers into the payload buffers (both on the publication and the subscription side). It will only be available for POD types though, as the buffer being returned would be to the serialized payload buffer, in order to avoid (de)serialization of the data-type.

With these new API extensions in place, we would have intra-process zero-copy available. We are also working on an inter-process data-sharing mechanism, in which the payloads will be created on a shared memory segment / mapped file. The combination of these two mechanisms will provide inter-process zero-copy.

2 Likes

Please see my answer above regarding the zero-copy improvements.

Are there any more improvements on the pipeline for intraprocess communication? Specially addressing lock-free publication/subscription.

When designing improvements for intra-process, we also have to account for remote endpoints. The publisher has to be prepared to have subscriptions on the same process, on another process in the same host, and on a different host. It also needs to take care of multiple threads using the same publisher, not only user threads, but also the asynchronous sender thread. This implies some synchronization requirements, as you could understand.

But if you only have one publisher and one subscriber for a topic, and both of them are on the same process, I think it could be possible to get there, though that would depend mostly on the OS (i.e. the mutex operations timing when blocking is not necessary)

Yeah, I understand the wide array of use cases you need to support, and obviously lock-free is not an option for all of them. I didn’t mention but I was thinking about single publisher/single subscriber intra-process communication.

I believe you are doing an amazing work, and it’s great that you keep pushing forward both in features as well as code quality as seen in today’s other post.

@MiguelCompany thank you for your nice explanations. I will track the further development with my team.