New Fast DDS Performance Testing

Hi ROS community!

We’re happy to present the complete 2022 performance test of Fast DDS!

The following tests show the latency and throughput of Fast DDS, the default middleware for the latest ROS 2 Humble version, in compare with other open-source DDS implementations such us Cyclone DDS and OpenDDS.

The benchmark results, carried with Apex.ai performance tests, show that Fast DDS is a very stable performer, beating Cyclone DDS and OpenDDS in latency intra and inter process, and performing similarly in throughput, as the Apex.ai tests fail to push the throughput limits of the DDS implementations.

You can find more information on the test results and configuration environment in the following links:

LATENCY INTRA-PROCESS DELIVERY
image
Fig 1.1a Latency: Fast DDS 2.8.0 intraprocess vs Cyclone DDS 0.9.0b1 intraprocess & Open DDS 3.13.2 intraprocess

image
Fig 1.1b Fast DDS 2.8.0 intraprocess vs Cyclone DDS 0.9.0b1 intraprocess & Open DDS 3.13.2 intraprocess up to 64 KB

LATENCY INTER-PROCESS DELIVERY
image
Fig 1.2a Fast DDS 2.8.0 interprocess vs Cyclone DDS 0.9.0b1 interprocess & Open DDS 3.13.2 interprocess

image
Fig 1.2b Fast DDS 2.8.0 interprocess vs Cyclone DDS 0.9.0b1 interprocess & Open DDS 3.13.2 interprocess up to 64 KB

THROUGHPUT INTRA-PROCESS DELIVERY
image
Fig 2.1a Throughput: Fast DDS 2.8.0 intraprocess vs Cyclone DDS 0.9.0b1 intraprocess & Open DDS 3.13.2 intraprocess

THROUGHPUT INTER-PROCESS DELIVERY
image
Fig 2.1b Throughput: Fast DDS 2.8.0 interprocess vs Cyclone DDS 0.9.0b1 interprocess & Open DDS 3.13.2 interprocess

5 Likes

@Daniel_Cabezas kudos for this performance benchmarking effort.

A couple of comments on this statement:

  • performance_test is fantastic but it’d be great if you could move into a Low-Overhead Framework for Real-Time Tracing of ROS 2. Performance tests was not meant originally for real-time tracing or benchmarking. Robots are real-time systems and to support these claims you should be making “the right measurements”.
    We’ve been working towards standardizing benchmarking ROS 2 graphs at https://github.com/ros-infrastructure/rep/pull/364. I’d be great if you could contribute to that document and to the RobotPerf project. Adding a few benchmarks that help reproduce these results would add lots of credibility.
  • Beyond the results, there’re some serious performance issues with FastDDS. I’ve experienced this myself very recently while running performance benchmarking tests in simple graphs (totally unrelated to the underlying DDS impl.). See a1 benchmark, preliminary results on Intel x86_64 · Issue #2 · robotperf/benchmarks · GitHub. I encountered lots of issues running this simple benchmark (2 nodes graph involving simple perception image transformations) with FastDDS. Errors drop the following:
[component_container-1] 2023-01-20 10:12:32.692 [RTPS_MSG_IN Error] (ID:281472840857824) Problem reserving CacheChange in reader: 01.0f.96.5b.c4.0d.51.71.01.00.00.00|0.0.2f.4 -> Function processDataMsg
[component_container-1] 2023-01-20 10:12:32.692 [RTPS_MSG_IN Error] (ID:281472840857824) Problem reserving CacheChange in reader: 01.0f.96.5b.c4.0d.51.71.01.00.00.00|0.0.1c.4 -> Function processDataMsg
[component_container-1] 2023-01-20 10:12:32.692 [RTPS_MSG_IN Error] (ID:281472840857824) Problem reserving CacheChange in reader: 01.0f.96.5b.c4.0d.51.71.01.00.00.00|0.0.2f.4 -> Function processDataMsg
...

CycloneDDS worked like a charm. Exact same setup. For this reason we’re (for now) sticking to CycloneDDS for RobotPerf benchmarks but I’d love to rectify this and consider also FastDDS.

3 Likes

I’d be careful with the absolutism. As you show in Figure 1.1b, CycloneDDS delivers the best performances for data up to 64Kb (attentive readers will have noticed that this is true in all graphs). Thus the reason for not performing as well for bigger data is probably the configuration of the networking service and how this impact fragmentation. As you probably can gather, with some changes to the configuration you would get cyclone beating FastDDS also for bigger messages… As the stack as already shown its efficiency for smaller messages. I guess that the default config was used for Cyclone DDS right? That default configuration is a trade-off between good performance and reduced memory utilisation.

Anyway, this raises another point, and also a reason why we never bother evaluating against other DDS vendors. Evaluations should be done by independent 3rd parties, with support from vendors. Additionally, a key principle in science is that of experiments repeatability and independent validation… If these two points are missing, then it is just white noise from my perspective. With Cyclone we know that number we show are absolutely repeatable.

Finally, I’d like to recall that a drag-car is kind of useless in real life. With Cyclone DDS we are more interested in building software that works in real-systems.

7 Likes

Why is cyclone with loans slower than without loans on the largest size? Does that make any sense?

1 Like

That is indeed strange. @eboasson any thoughts?

Good question, and giving any response that has even the slightest whiff of a suggestion that it is answer without first reproducing it and studying what is actually happening, is fraught with peril. So I won’t do that.

Actually reproducing the results will be tricky anyway, because some relevant bits of the test setup have been left unspecified, like the build configuration and whether or not ROS 2 is involved. I am not familiar with the details of the Cyclone backend of the current version of performance_test and I don’t know which language binding of Cyclone it uses, but I do hope it has been updated since I did the initial port.

The information doesn’t tell us if Iceoryx is involved. If it is, the intra-process path switches from Cyclone’s native intra-process to Iceoryx’s intra-process path. The details of that data path then also depend a bit on the data types involved. If I were to hazard a guess why the loaned path is slower …

On another note: when I see a mean intra-process latency of 100µs and an inter-process one of 200µs on modern hardware for small messages, then I am nearly certain that we are really looking at how much time it takes a modern (Intel, in this case) CPU to hear an alarm ring, snooze it, finally wake up from a sleep state, drink some coffee, and only then check whether the alarm sounded for a reason. At which point it rushes through the work and, totally exhausted, falls asleep again. Rinse and repeat. It really takes almost 100µs to come out of a light sleep for these devices.

Just juggling the mouse cursor running this test would already change the numbers significantly; running nice bash -c 'for i in {0..7} ; do while true ; do : ; done & done' even more (beware of possible typos). The correct way of course is to fiddle with the power governance settings.

Basically, that means these proudly presented graphs are all but worthless except for what concerns the really large messages.

I also would like to link to our Humble middleware report. That actually contains the information needed to reproduce, including the data munging used.

1 Like

Hi @eboasson

There is a similar Fast DDS Humble Middleware Report and the corresponding Open Robotics ROS Middleware Evaluation Report.

Based on these reports the ROS2 TSC voted and selected Fast DDS as the default middleware for humble.

During the last years there has been a race on performance. The Fast DDS team didn’t choose these metrics, and personally, I would have preferred other criteria such as features and documentation and I still think the same, but we took the gauntlet and now performance is part of our DNA. That is why we run these performance tests (and more) every night in our CI, to compare our performance with previous commits and releases of Fast DDS and also with other DDS implementations.

I don’t want to start a flame war with you guys: The framework to reproduce these tests is very public and any user can run the same tests we are showing here.

2 Likes