ROS2 speed

I’m so embarrassed. I mis-read the README in your repo, and thought the goal was 5,000 elements per message, but then I re-read and saw 50,000 :rofl: so a few more games are required:

  • replace the array of builtin_interfaces/Time with primitive arrays of secs and nsecs. Otherwise it has to call the serialization/deserialization functions for that type.
  • set the UDP socket buffers much larger. I set them all to 64 MB just for fun.
  • pre-allocate the message outside the main loop, rather than allocating a new MsgType every time.
  • for Cyclone, set MaxMessageSize and FragmentSize to very large values, along with increasing WchHigh, in a cyclone XML file. Documented here: GitHub - eclipse-cyclonedds/cyclonedds: Eclipse Cyclone DDS project

With those steps, it sends fine (no dropped messages) at 50,000 elements on FastDDS and Cyclone, and on Cyclone it goes up through 150,000 elements per message (at 1 KHz) without any drops. That is just about maxing a core for each (publisher and subscriber) on my machine, so if you actually do anything to the data, it would start having dropped messages unless you’re careful with multi-threading. I didn’t spend time tuning FastDDS but I’m sure it could also go well beyond 50k elements with some tuning effort.

In general though, I think this is probably close to the limit of what DDS (UDP) can do. I’d suggest to use the composition feature of ROS 2 as well as using the new shared-memory features of both FastDDS and Cyclone (iceoryx), some of which may need fixed-size messages in order to work nicely. It’s a lot of data to throw around.

2 Likes

What might help is also to set ROS_LOCALHOST_ONLY=1 and for it to work to enable multicast on the loopback interface (if on Ubuntu 20.04)

While it seems there are some approaches which mitigate / solve the immediate problem reported by @Bernd_Pfrommer, this observation is pretty spot-on:

I don’t have an easy fix, and I fully believe I understand why things are the way they are in ROS 2 – and I also understand that making high-performance networking applications is still complex and needs lots of moving parts to do just the right things in the right way – but seeing what @codebot had to do to get to the performance of @Bernd_Pfrommer’s ‘naive’ ROS 1 implementation makes me sad.

It connects to ROS2 Default Behavior (Wifi) and a few other threads we’ve had about the user experience of ROS 2 compared to ROS 1.

In this case a user actually reached out and asked what he could do better to improve things. In others, that never happens, and ROS 2 is discarded as not-ready, not performant enough or “even worse than ROS 1”.

(on the bright side: it’s consultancy heaven like this. People-in-the-know are in a very good position)


Edit: perhaps ROS2 speed - #19 by codebot and ROS2 speed - #21 by codebot should be turned into a guide in the ROS 2 documentation? Could be a good start for a performance-oriented set of patterns and/or a ROS 2 tuning guide.

It doesn’t seem like there is a lack of knowledge on these subjects. It just seems to be hidden in developer minds and collective experience in organisations.

19 Likes

Apart from what has already been proposed, how about publishing the events directly, instead of a variable-length array of them? How would the example behave for a rate of 50Mhz of 16 bytes messages?

Thanks for the support!

The link “ROS2 Default Behavior (Wifi)” was a very interesting one and I read many of the posts there. In particular because when filing the bug report against the cyclone DDS I came across @allenh1’s bug report about poor wifi performance, so this is apparently still an ongoing issue. Granted for wifi problems it’s impossible to provide a simple reproducible setup because the hardware involved is very non-standard and highly configurable. Still, the fact that experienced ROS/ROS2 engineers struggle with getting reasonable transmission rates over Wifi is troublesome.
Another part of the problem: the ROS community is just too nice. I deliberated for a while whether I should actually post this under a thread that’s called “ROS2 speed”, although my problem was very similar to the OP’s. Are you going to search for “speed” or “slow” or “performance issues” when you have issues? Probably the latter two. In a similar vein, if you were seeing terrible wifi performance under ROS2, would you think that’s the focus of a thread called “ROS2 Default Behavior (Wifi)”? Probably not. I don’t mean to criticize either of the posters for being civilized, in fact they stand out for taking time to raise the issue in the first place.

Along those lines, here are a suggestion: clearly document ROS2 performance issues vs ROS1 and point to possible workarounds. It will not be great advertisement for ROS2, but still better than people learning about it the hard way. I did not know how expensive the marshalling of messages is in ROS2 compared to ROS1, nor how expensive the unmarshalling is with python under ROS2. Documentation would also help for the wifi issues with known-to-work configuration examples etc.

If possible from a technical point, can the basic performance tools like e.g. “ros2 topic hz” written such that they work as expected or print warnings when they are not reliable? Can they be written in C++ or fixed by some other means?

6 Likes

Under cyclonedds, I can send about 137 kHz of the messages below, under fastrtps about 130 kHz. No chance of scaling that to 50MHz.

uint16 x
uint16 y
uint64 ts
bool polarity

Under cyclonedds, I can send about 137 kHz of the messages below, under fastrtps about 130 kHz. No chance of scaling that to 50MHz.

I can today squeeze 3MHz out of Cyclone (*) for that data type over the loopback interface by letting it batch samples in larger messages, something I am working on enabling properly so that it can also be used in ROS. There is definitely some room for improvement, so perhaps 5MHz could be done with some effort, but 50MHz is another factor of 10 … that’s not likely to ever happen.

With the Iceoryx integration I think it ought to be feasible but that, too, will take a bit of time.

So yes, I’m afraid you’ll have to accept using arrays …

(*) M1-based MBP, using Cyclone DDS master branch and with ddsperf hacked to support your data type:

# CYCLONEDDS_URI="<Gen><NetworkInterfaceAddress>lo0</><MaxMessageSize>65500B</></>"
# bin/ddsperf -TUKC sub & bin/ddsperf -TUKC pub & wait
[62678] 12.005 2.99M/s   0n |@=......................... ..|   0% 182u
[62678] 12.005  rss:6.1MB vcsw:607 ivcsw:1049 recvUC:10%+0% pub:71%+23%
[62677] 12.005  size 13 total 35811458 lost 0 delta 2992035 lost 0 rate 2992.00 kS/s 311.17 Mb/s (3000.37 kS/s 312.04 Mb/s)
[62677] 12.005  rss:3.5MB vcsw:18808 ivcsw:925 recvUC:76%+4%
[62678] 13.005    3M/s   0n |@=.........................  .|   0% 241u
[62678] 13.005  rss:6.1MB vcsw:611 ivcsw:884 recvUC:10%+0% pub:71%+23%
[62677] 13.005  size 13 total 38815968 lost 0 delta 3004510 lost 0 rate 3004.74 kS/s 312.49 Mb/s (3000.17 kS/s 312.02 Mb/s)
[62677] 13.005  rss:3.5MB vcsw:19886 ivcsw:805 recvUC:77%+4%

(20% of the publisher’s time is spent in sendmsg() :flushed: macOS could do with a better network stack …)

7 Likes

Some closing remarks on the ROS2 performance issues I encountered when writing a ROS2 driver for the event based Prophesee cameras.

The serialization/deserialization of structures in ROS2 is currently much slower than in ROS1 in particular if such structures contain other non-primitive data types like e.g. Time. To get high performance in ROS2 currently requires avoiding such situations. I ended up squeezing all event data into a uint64 field and then essentially packing/unpacking manually into that 64bit space. As a fringe benefit, the compressed data format also cut the storage and bandwidth requirements by almost a factor of two.

AFAIK Galactic currently does not support running rosbag record as a composable node. However with a small temporary tweak to the Recorder code I was able to write my own composable node that can store without inter-process communication.

With all the necessary optimizations in place (simple data types, composable nodes) ROS2 (cyclonedds) actually runs with slightly less CPU consumption than ROS1 as shown in this table (scroll all the way down). So it’s not like ROS2 can’t perform, it’s just that one has to work harder to get there.

I’m still curios though as to what the path forward is for fixing the serialization/deserialization performance issues. Can/will these be fixed by a more efficient rmw implementation?

3 Likes

The serialisation and deserialisation are up to the implementation of CDR provided/used by the underlying DDS implementation. I’m not sure there’s much that the RMW implementation can do about it, since it just passes what is effectively a binary blob (the message data) to the serialiser and gets back another binary blob (the serialised data).

1 Like

@Bernd_Pfrommer could you create an issue in RMW for the slow (de)serialization? Even though @gbiggs thinks there’s not much RMW can improve, I think it’s important to have the issue there so that 1) people know about it, 2) TSC can start cooperating with the DDS vendors and look for solutions.

1 Like

Please do, we love GitHub issues to keep improving the Eclipse Foundation contributed ROS middleware. And hit us on gitter IM: cyclonedds, iceoryx, Zenoh.

What do you think about enabling cyclonedds’ built-in iceoryx shared memory by default?

Our tests show it improves these measures: latency, jitter, memory, cpu, throughput. Matthias at Apex.AI is preparing PR that makes it support dynamically sized messages for different QoS so users won’t need prior knowledge or configuration. here look for “Cyclone DDS with iceoryx Improvement” for an idea of what enabling it by default would do. Note that latency further improved since these tests were run because PR#348 is merged.

Good idea. I had already created an issue here:

and now asked for it to be kept open now for documentation purposes and for opportunity to improve it.

2 Likes