DDS implementation performance benchmark

Hello DDS fans!
Today, we have a very interesting study to share with you. A DDS middleware performance comparison!

As you may know, we were recently pioneering the zero-copy implementation in our latest release of Fast DDS v2.2.0, where we have also improved the general performance. Curious? Check out our complete latency and throughput tests with Fast DDS, here.

At the same time, we were interested in seeing our performance in comparison with the newly voted default DDS middleware implementation for Galactic, Cyclone DDS.

The performance benchmark tests include latency results with both intraprocess and interprocess communications, and throughput with interprocess communication. Let the graphics speak for themselves.

Latency: intra-process

Fast DDS v2.2.0 - CycloneDDS 0.7.0 (Intraprocess)

Latency: inter-process

Fast DDS v2.2.0 - CycloneDDS 0.7.0 (Interprocess)

Throughput: inter-process

Fast DDS v2.2.0 - Cyclone 0.7.0 comparison

Fast DDS v2.2.0 - Cyclone 0.7.0 comparison (1)
Fast DDS v2.2.0 (zero-copy) excluded in order to allow the visualization of the remaining DDS implementations performance

Conclusion

The results are clear, Fast DDS v2.2.0 is the highest performing open source DDS implementation!
It performs considerably better in latency and throughput, and allows personalized configuration.

Eliminating unnecessary data copies from buffer to buffer is worth it!

For the complete latency and throughput performance benchmarks, and in order to reproduce the results, check out the article “Fast DDS vs Cyclone DDS performance”.

7 Likes

Thank you! Could you check “Throughput: inter-process” result? I see only 2 sets of data in the graph where the legend has 7 of them.

Is the inter-process throughput y-limit really 70 Tbps?

Or should I read the graph like “we stored 8 MB payload in memory and then we passed the pointer to it to a subscriber, so we transferred 8 MB in a single (more or less) CPU instruction?” that would be really unfortunate interpretation

Hi @yjkim046 ,

I’m afraid the graph is correct, the 7 series are there, but 6 of them are no very visible due to the difference with the zero-copy one. The second throughput graph shows the same data without the zero-copy series so the other 6 are actually visible. I’ve made a chart from the same data for the more trained eye containing all the series in a log-log scale.

Fast DDS v2.2.0 - Cyclone 0.7.0 comparison (2)

Hi @peci1 ; zero-copy here means that the user loans a sample from the DataWriter to put their data into it. Then, Fast DDS uses its data-sharing mechanism to share the data with the DataReader, which in turns lets the user read directly from the buffer (loaning the sample), instead of handing a copy of the data. This way, Fast DDS does not perform a single copy of the buffer that the user populates. So in a nuteshell, yes, the DataReader gets a pointer to the data from the DataWriter history. I’m not sure I follow why is that unfortunate. Throughput here is simply the amount of data that the middleware is able to handle to the DataReader’s user per time unit.

Thanks for the clarification, though it’s still not clear to me. Do the test times include creation of the data to be sent (and what are the data? pseudorandom sequences? zeros?) ? Do the times include (de)serialization, or is it just the time to transfer some prepared buffers? Do the times include copying/writing the data to be sent in the loaned buffer? Because, I can’t figure out how you would achieve 70 Tbps if you had to write all the data to RAM at least once - the fastest RAM modules available allow for speeds around 300 Gbps…

All the tests are measuring from right before the call to write to right after the call to take. As you say, the time it takes to prepare the sample is not included in any of the measurements, since that overhead is considered to occur outside the middlewares. Including that overhead can in fact be a different and interesting experiment, but it is not the intention of these ones, where the capabilities of just the middleware are addressed.

Hi all,

It is important to highlight the shared memory transport and the zero-copy feature are totally integrated into Fast DDS, supported in all the ROS 2 supported platforms, and transparent for the final user.

There are other PoC out there in other open-source DDS implementations, but the only one really available and ready to use is the one we are shipping in Fast DDS. The zero-copy feature was a fundamental milestone in our roadmap.

2 Likes

Thanks for the great work guys.

I imagine the answer, but are this zero copy features real-time safe as well? Are there any plans to make them?

Back in november I did a comparison of a few DDS implementations, Orocos and a plain lock free buffer for intra-process communication. Our goal was to find a way to pass messages in a real time safe way, which typically involves lock-free, no memory allocation and no condition variables.

I was measuring max latencies on my machine with an RT Preempt patch, Orocos and the buffer were at 3.5 and 2.5 microseconds respectively, and the Fast DDS was at 11 microseconds (which was the best DDS implementation at that time).

This was before all your recent improvements, and at some point in the future I’ll repeat them, but I was curious because most of the latency issues were due to non-RT safe operations.

Our control cycle is 0.5millisecond, or 500 microseconds, so for us an 8 microsecond difference per publish is quite significant. Obviously I’m not expecting getting down to the plain lock-free buffer latencies, but something closer to it is possible as seen with Orocos.

I’ve seen the latency measurements of your other post, but I can’t directly compare to my results due to different hardware, message sizes and system load, the RT safety is what will be determinant in this case, and it will greatly impact max latencies but not necessarily mean latency.

Edit: Also taking the opportunity to ask you for some feedback on the issue I opened: Restarting nodes causes other nodes to "disappear" · Issue #509 · ros2/rmw_fastrtps · GitHub

1 Like

Hi @v-lopez ,

Thanks for the involvement!

I’m guessing you’re imagining correctly. We can and we do avoid memory allocation (except on start-up), but we do have both lock mechanisms and condition variables in place. Basically, what we have done is adding sample loaning so zero-copy can be achieved. However, the synchronization mechanisms remain the same. We do offer some RT features: allocation free (as I said earlier), and a configurable max blocking time for blocking calls. Sadly, there are no plans for alternative lock-free mechanism presently.

It be great to see new benchmarks with all the new improvements, including zero-copy. As you’ve seen, our performance has improved significantly. I don’t know if you have some hard requirement on max latency, but Fast DDS current implementation may be enough for you already. I’d be happy to give a hand with setting up the benchmark and interpreting the results!

I think these types of cases should be used to impose performance requirements on ROS 2 tier 1 middlewares in general. Of course being the most performative implementation is essential, but I also think it’s very much relevant to know what the community expects of ROS 2. This way, we can make crucial decisions based on actual users needs. Maybe we should bring this topic to the Middleware WG. I’d say that running a control at 2000 Hz it’s pretty demanding!

We’ve been pretty focus on Galatic API freeze, but this has not gone unnoticed. We will put some time into it this week!

Thanks for the thorough response @EduPonz.

Indeed communicating at 2000Hz is not a common scenario for most users, for some of our robots it is. This communication doesn’t need to be through ROS2, in fact in ROS1 it is not.

But having the possibility of doing it through ROS2 would allow reuse all existing tooling for it, we could record rosbags, introspect with command line tools and plot it live without changing a single line of code.

Our requirements are that all the operations in a control cycle must be done in 0.5 milliseconds consistently, as missing an update loop can have significant consequences.
That’s why determinism is even more important than performance, for us it’s better something that averages 4 microseconds, but never exceeds 5 microseconds, that something that averages 2 microseconds with 15 microsecond peaks.

We’re starting with the migration of our less demanding robots, and keeping an eye open on how to implement in the future the high frequency control loop. I’ll get back to you in the future for a second round of benchmarking.

Looking forward to it!

Regarding the tooling, we are doing some work on our side on that. On the mean time, I’d share some instructions on how to use rosbag2 with Fast DDS that you might find helpful.

Hi , Can you please share the detail about the testing? I mean, step by step, I wanna reproduce it on my enviroment.