We announced this week our new Fast DDS 2.1 release for Foxy, and I promised some data about one of the new features: The improved intra-process & inter-process performance.
Fast DDS was really fast already, but these last months we have been making some modifications because the upcoming Zero Copy shared memory transport for the end of this year. As a result, Fast DDS 2.1 is now striking fast, and the larger the data, the bigger the improvement.
Here you can see several graphs showing latency and throughput for the RMW_FastRTPS, in both single process and two process configurations, comparing Sync and Async alternatives with the prior version of Fast DDS
In the next graph, we try to send 1000 messages in one second of different sizes, to measure the end-to-end processing capacity and get how many messages by second RMW Fast RTPS is able to handle.
Congratulations, looks great ! The intra-process and inter-process latencies do not differ that much for the 2.1.0_sync mode. So FastDDS seems to use the same data path doesn’t matter if it is intra- or inter-process, right ?
Is there a need to copy memory for intra-process communication in sync mode at all or wouldn’t it make sense to just “forward the allocated memory” to the connected subscriber ? Maybe that doesn’t matter for the next “zero copy” release anymore.
That looks great, thanks. Also looking forward to hearing about the zero-copy improvements that are coming.
Are there any more improvements on the pipeline for intraprocess communication? Specially addressing lock-free publication/subscription.
We’ve been looking into different alternatives for publishing data in a realtime system at 2KHz and all DDS implementations tried, the fastest we could get was 0.01ms latency for small Packets of data (256Bytes). which was 3x times as much as using a lock free circular buffer or a RT focused system like Orocos.
We understand going below 0.01ms might not be a typical use case for DDS, but having lock-free and deterministic publication and subscription time would open up the possibility of publishing directly from RT threads without affecting the determinism of the rest of the loop.
The intra-process and inter-process latencies do not differ that much for the 2.1.0_sync mode
It may not be clear in the graphs, but the results do have a substantial difference from intra-process to inter-process. For the last dot in the lines, that correspond to the PointCloud8m data-type, latency compares 2.65 vs 3.68. But the important thing is that, apart from having slightly better latency, the number of samples transmitted compares 730 vs 208
Is there a need to copy memory for intra-process communication in sync mode at all or wouldn’t it make sense to just “forward the allocated memory” to the connected subscriber?
This is exactly what 2.1.0 is bringing in. When the publisher and the subscriber are on the same process, payloads are not copied, and reference counts are updated instead.
Maybe that doesn’t matter for the next “zero copy” release anymore.
The zero-copy mechanism mainly involves API extensions allowing the user to get pointers into the payload buffers (both on the publication and the subscription side). It will only be available for POD types though, as the buffer being returned would be to the serialized payload buffer, in order to avoid (de)serialization of the data-type.
With these new API extensions in place, we would have intra-process zero-copy available. We are also working on an inter-process data-sharing mechanism, in which the payloads will be created on a shared memory segment / mapped file. The combination of these two mechanisms will provide inter-process zero-copy.
Please see my answer above regarding the zero-copy improvements.
Are there any more improvements on the pipeline for intraprocess communication? Specially addressing lock-free publication/subscription.
When designing improvements for intra-process, we also have to account for remote endpoints. The publisher has to be prepared to have subscriptions on the same process, on another process in the same host, and on a different host. It also needs to take care of multiple threads using the same publisher, not only user threads, but also the asynchronous sender thread. This implies some synchronization requirements, as you could understand.
But if you only have one publisher and one subscriber for a topic, and both of them are on the same process, I think it could be possible to get there, though that would depend mostly on the OS (i.e. the mutex operations timing when blocking is not necessary)
Yeah, I understand the wide array of use cases you need to support, and obviously lock-free is not an option for all of them. I didn’t mention but I was thinking about single publisher/single subscriber intra-process communication.
I believe you are doing an amazing work, and it’s great that you keep pushing forward both in features as well as code quality as seen in today’s other post.
Just checked it, looks good. Currently I do a lot of performance measures using the APEX AI performance test suite to figure out how much performance is wasted in the rmw frame (serialization / deserialization / data queuing …). I set up a fresh Ubuntu 20.04 machine with the latest ROS2 Foxy and stretch all RMW’s as much I can (different messages sizes, message types, data rates, pub / sub combinations). I tried to setup the same for a Windows 10 machine, but could not get the performance suite to run because of native Linux dependencies (rusage).
First tests are showing that the low level “zero or not zero copy” memory transfer is not that relevant for smaller messages up to 256kB. Starting from 1 MB it’s getting interesting, but I’m in between the evaluation.
In my experience (with other middlewares) this cut-off is related to cache size(s) of the CPU(s) used.
Not sure whether this directly translates to the DDS and RWM implementations you are testing, but it could be interesting to see whether you see a similar (cor)relation.
@gavanderhoorn, you are right, the size of the L2 cache seems to be one performance factor. My machine has 1.5 mb L2 cache and for messages larger than that size for some rmw’s the performance is decreasing none linear (latency increasing). All tests run on the same host - it’s all inter-process communication.
The results are sorted per experiment for RMW_IMPLEMENTATION and MessageType. The parameter for every experiment are NUM_PUB, NUM_SUB, DATA_RATE and MAX_DURATION, plots are generated for every single run and stored besides the log file as pdf (perfplot).
If I run the batch for all message types (19) over the 3 current rmw’s for 120 s it takes around 2 hours per experiment. So I really should configure all correctly before comparing the results :-). Currently I did not tune anything (XML profiles …).
eCAL RMW and Cyclone DDS RMW are installed from the latest public release. The Fast(RTPS)DDS RMW is used like it is shipped with Foxy. I’m open for any kind of suggestion, which rmw to add, how to configure it and which version to use.
The first thing you have to take into account is ROS 2 Fast RTPS is Async by default. If you see the graphs, you can see latency is better if you select sync behavior. You have a post explaining how to change it here
In the meanwhile I started a bunch of script driven tests and sorted all results by rmw implementation, message type, pub/sub connections and data rates. For this first run I did not change any kind of QOS or other parameter for the rmw’s to have a starting point. So the QOS HISTORY_KIND = KEEP_ALL' is for sure problematic in the case of large messages published with high data rates to multiple subscribers. It’s all simply put on github for now.
For now the results are not analyzed in detail, on the first rough view the performance for small, less structured messages is in the same dimension for all rmw’s. Differences can be seen for larger messages like PointCloudXY or Array2m.
Another result of this first run is, that the ipc performance of the underlaying mechanism is compared to the performance of the rmw not that important. The time for serializing and deserializing the ROS message types is the main communication bottleneck. For eCAL I made measurements on the same machine that are showing that a raw 1MB payload can be exchanged in about 150 µs, exchanging an Array1m ROS message using the current rmw_ecal takes around 1ms (the other rmw’s are slightly faster unfortunately ;-).
I would like rerun all test’s with an optimized profile again (QOS HISTORY_KIND ´KEEP_LAST´ (Depth=1) …), later on I need to write some python scripts bringing all results in one view for better comparison. Additional I will try to bring in rmw_iceoryx and check how it performs in the various experiments.
I’m open for any suggestion how to setup the rmw’s in an optimal way.