I’m so embarrassed. I mis-read the README in your repo, and thought the goal was 5,000 elements per message, but then I re-read and saw 50,000 so a few more games are required:
replace the array of builtin_interfaces/Time with primitive arrays of secs and nsecs. Otherwise it has to call the serialization/deserialization functions for that type.
set the UDP socket buffers much larger. I set them all to 64 MB just for fun.
pre-allocate the message outside the main loop, rather than allocating a new MsgType every time.
With those steps, it sends fine (no dropped messages) at 50,000 elements on FastDDS and Cyclone, and on Cyclone it goes up through 150,000 elements per message (at 1 KHz) without any drops. That is just about maxing a core for each (publisher and subscriber) on my machine, so if you actually do anything to the data, it would start having dropped messages unless you’re careful with multi-threading. I didn’t spend time tuning FastDDS but I’m sure it could also go well beyond 50k elements with some tuning effort.
In general though, I think this is probably close to the limit of what DDS (UDP) can do. I’d suggest to use the composition feature of ROS 2 as well as using the new shared-memory features of both FastDDS and Cyclone (iceoryx), some of which may need fixed-size messages in order to work nicely. It’s a lot of data to throw around.
While it seems there are some approaches which mitigate / solve the immediate problem reported by @Bernd_Pfrommer, this observation is pretty spot-on:
I don’t have an easy fix, and I fully believe I understand why things are the way they are in ROS 2 – and I also understand that making high-performance networking applications is still complex and needs lots of moving parts to do just the right things in the right way – but seeing what @codebot had to do to get to the performance of @Bernd_Pfrommer’s ‘naive’ ROS 1 implementation makes me sad.
It connects to ROS2 Default Behavior (Wifi) and a few other threads we’ve had about the user experience of ROS 2 compared to ROS 1.
In this case a user actually reached out and asked what he could do better to improve things. In others, that never happens, and ROS 2 is discarded as not-ready, not performant enough or “even worse than ROS 1”.
(on the bright side: it’s consultancy heaven like this. People-in-the-know are in a very good position)
Edit: perhaps ROS2 speed - #19 by codebot and ROS2 speed - #21 by codebot should be turned into a guide in the ROS 2 documentation? Could be a good start for a performance-oriented set of patterns and/or a ROS 2 tuning guide.
It doesn’t seem like there is a lack of knowledge on these subjects. It just seems to be hidden in developer minds and collective experience in organisations.
Apart from what has already been proposed, how about publishing the events directly, instead of a variable-length array of them? How would the example behave for a rate of 50Mhz of 16 bytes messages?
The link “ROS2 Default Behavior (Wifi)” was a very interesting one and I read many of the posts there. In particular because when filing the bug report against the cyclone DDS I came across @allenh1’s bug report about poor wifi performance, so this is apparently still an ongoing issue. Granted for wifi problems it’s impossible to provide a simple reproducible setup because the hardware involved is very non-standard and highly configurable. Still, the fact that experienced ROS/ROS2 engineers struggle with getting reasonable transmission rates over Wifi is troublesome.
Another part of the problem: the ROS community is just too nice. I deliberated for a while whether I should actually post this under a thread that’s called “ROS2 speed”, although my problem was very similar to the OP’s. Are you going to search for “speed” or “slow” or “performance issues” when you have issues? Probably the latter two. In a similar vein, if you were seeing terrible wifi performance under ROS2, would you think that’s the focus of a thread called “ROS2 Default Behavior (Wifi)”? Probably not. I don’t mean to criticize either of the posters for being civilized, in fact they stand out for taking time to raise the issue in the first place.
Along those lines, here are a suggestion: clearly document ROS2 performance issues vs ROS1 and point to possible workarounds. It will not be great advertisement for ROS2, but still better than people learning about it the hard way. I did not know how expensive the marshalling of messages is in ROS2 compared to ROS1, nor how expensive the unmarshalling is with python under ROS2. Documentation would also help for the wifi issues with known-to-work configuration examples etc.
If possible from a technical point, can the basic performance tools like e.g. “ros2 topic hz” written such that they work as expected or print warnings when they are not reliable? Can they be written in C++ or fixed by some other means?
Under cyclonedds, I can send about 137 kHz of the messages below, under fastrtps about 130 kHz. No chance of scaling that to 50MHz.
I can today squeeze 3MHz out of Cyclone (*) for that data type over the loopback interface by letting it batch samples in larger messages, something I am working on enabling properly so that it can also be used in ROS. There is definitely some room for improvement, so perhaps 5MHz could be done with some effort, but 50MHz is another factor of 10 … that’s not likely to ever happen.
With the Iceoryx integration I think it ought to be feasible but that, too, will take a bit of time.
So yes, I’m afraid you’ll have to accept using arrays …
(*) M1-based MBP, using Cyclone DDS master branch and with ddsperf hacked to support your data type:
The serialization/deserialization of structures in ROS2 is currently much slower than in ROS1 in particular if such structures contain other non-primitive data types like e.g. Time. To get high performance in ROS2 currently requires avoiding such situations. I ended up squeezing all event data into a uint64 field and then essentially packing/unpacking manually into that 64bit space. As a fringe benefit, the compressed data format also cut the storage and bandwidth requirements by almost a factor of two.
AFAIK Galactic currently does not support running rosbag record as a composable node. However with a small temporary tweak to the Recorder code I was able to write my own composable node that can store without inter-process communication.
With all the necessary optimizations in place (simple data types, composable nodes) ROS2 (cyclonedds) actually runs with slightly less CPU consumption than ROS1 as shown in this table (scroll all the way down). So it’s not like ROS2 can’t perform, it’s just that one has to work harder to get there.
I’m still curios though as to what the path forward is for fixing the serialization/deserialization performance issues. Can/will these be fixed by a more efficient rmw implementation?
The serialisation and deserialisation are up to the implementation of CDR provided/used by the underlying DDS implementation. I’m not sure there’s much that the RMW implementation can do about it, since it just passes what is effectively a binary blob (the message data) to the serialiser and gets back another binary blob (the serialised data).
@Bernd_Pfrommer could you create an issue in RMW for the slow (de)serialization? Even though @gbiggs thinks there’s not much RMW can improve, I think it’s important to have the issue there so that 1) people know about it, 2) TSC can start cooperating with the DDS vendors and look for solutions.
What do you think about enabling cyclonedds’ built-in iceoryx shared memory by default?
Our tests show it improves these measures: latency, jitter, memory, cpu, throughput. Matthias at Apex.AI is preparing PR that makes it support dynamically sized messages for different QoS so users won’t need prior knowledge or configuration. here look for “Cyclone DDS with iceoryx Improvement” for an idea of what enabling it by default would do. Note that latency further improved since these tests were run because PR#348 is merged.
I am really sorry to dig out this post, but I am wondering what are the improvements here within 3 years?
Was that issue fully solved by now or improved?
@Patrick, the short answer is that progress has been marginal.
I just reran the numbers, albeit on faster hardware (Intel i7-14700K 64G RAM). All measurements are done intra-node, so no network involved. I picked an array size of 50000 elements, and cranked up the publisher frequency until ROS1 started to drop messages due to rostopic bw running out of CPU resources.
All frequencies are in Hz and given as requested/without subscriber/with subscriber
Publisher CPU% is given with the subscriber running.
(Note: after a short while the subscriber cannot keep up (CPU goes to 100%) and receive frequency drops to 3600, not sure what’s happening there, probably something at the OS level)
The bottom line is that the marshaling of non-primitive data types is still slow under ROS2. Zenoh outperforms FastRTPS on the subscriber side, but it looks like fundamentally they have the same bottleneck at the publisher.
Note that ros2 topic hz will report 10Hz and run at 100% CPU. Apparently fixing this is somewhat more involved, see this issue.
Couldn’t agree more. I’m not using any of the new features like QoS and live cycle nodes etc, but I’d like to at least recover the qualities of ROS1 with respect to ease of use and performance.
The rmw_zenoh_cpp package is coming along. The support is good, issues get addressed very quickly, and performance wise it does better than the existing RMWs, see above (I did not tune any DDS or Zenoh parameters, all is default). I also can get it configured without pulling my hair out. I’ve been using it exclusively for the last 9 months and can only recommend it. The more adoption, the quicker we get an RMW that is of ROS1 quality.
But what about the message format? Does the marshaling have to be so slow or can it be sped up? The layout/size of the data members is known at compile time, so why would marshaling be so slow? One would at most expect some host-to-net conversion (little/big endian) to be necessary. Implementation wise however this could be a very difficult change to make that may touch much of the core ROS2 code. Maybe some of the core ROS2 engineers can comment on that?
Concur completely, we’re a multinational ROS power user, working with hundreds of nodes and thousands of line of ROS code. We don’t need or use any of the QoS, configuration or headache provided by ROS2. We just want equivalent ROS1 ease of use and peformance. We want to send a message and have it get there, so we can debug the application and not the middleware.
The foundation of serialisation and deserialisation in ROS 2 is provided by the RMW implementation: each RMW uses whatever is appropriate for that RMW. For example, the FastDDS RMW uses the CDR implementation provided by eProsima, and the Cyclone DDS one uses the CDR implementation provided by the Eclipse Cyclone DDS project. Serialisation speed is therefore fundamentally defined by the speed of the CDR implementation being used.
However, a key aspect of ROS 2’s serialisation is that it goes through the type support system. This consists of the generated serialisation and deserialisation functions that are created from templates when a message definition is compiled.
If there is a slowdown due to serialisation, it may be in either of these places (or both). Improving the speed of serialisation therefore requires someone to either:
profile the CDR implementation they are using, and find places to optimise it, or
profile the generated type support functions and find ways to generate faster code, without causing generation and compilation time to take too much longer than it already does, and without causing memory use of the generated code to rise too much.
Neither is easy, but we would love to see contributions in this area. Improving the speed of ROS isn’t flashy like a major new feature, but it benefits everyone, including the people who make that contribution. The second item is the harder one due to the flexibility of the type support system and the range of types it has to handle.