ROS2 speed

codebot · October 15, 2021, 6:21am

I’m so embarrassed. I mis-read the README in your repo, and thought the goal was 5,000 elements per message, but then I re-read and saw 50,000 so a few more games are required:

replace the array of builtin_interfaces/Time with primitive arrays of secs and nsecs. Otherwise it has to call the serialization/deserialization functions for that type.
set the UDP socket buffers much larger. I set them all to 64 MB just for fun.
pre-allocate the message outside the main loop, rather than allocating a new MsgType every time.
for Cyclone, set MaxMessageSize and FragmentSize to very large values, along with increasing WchHigh, in a cyclone XML file. Documented here: GitHub - eclipse-cyclonedds/cyclonedds: Eclipse Cyclone DDS project

With those steps, it sends fine (no dropped messages) at 50,000 elements on FastDDS and Cyclone, and on Cyclone it goes up through 150,000 elements per message (at 1 KHz) without any drops. That is just about maxing a core for each (publisher and subscriber) on my machine, so if you actually do anything to the data, it would start having dropped messages unless you’re careful with multi-threading. I didn’t spend time tuning FastDDS but I’m sure it could also go well beyond 50k elements with some tuning effort.

In general though, I think this is probably close to the limit of what DDS (UDP) can do. I’d suggest to use the composition feature of ROS 2 as well as using the new shared-memory features of both FastDDS and Cyclone (iceoryx), some of which may need fixed-size messages in order to work nicely. It’s a lot of data to throw around.

doisyg · October 15, 2021, 7:25am

What might help is also to set ROS_LOCALHOST_ONLY=1 and for it to work to enable multicast on the loopback interface (if on Ubuntu 20.04)

gavanderhoorn · October 15, 2021, 8:16am

While it seems there are some approaches which mitigate / solve the immediate problem reported by @Bernd_Pfrommer, this observation is pretty spot-on:

I don’t have an easy fix, and I ~~fully~~ believe I understand why things are the way they are in ROS 2 – and I also understand that making high-performance networking applications is still complex and needs lots of moving parts to do just the right things in the right way – but seeing what @codebot had to do to get to the performance of @Bernd_Pfrommer’s ‘naive’ ROS 1 implementation makes me sad.

It connects to ROS2 Default Behavior (Wifi) and a few other threads we’ve had about the user experience of ROS 2 compared to ROS 1.

In this case a user actually reached out and asked what he could do better to improve things. In others, that never happens, and ROS 2 is discarded as not-ready, not performant enough or “even worse than ROS 1”.

(on the bright side: it’s consultancy heaven like this. People-in-the-know are in a very good position)

Edit: perhaps ROS2 speed - #19 by codebot and ROS2 speed - #21 by codebot should be turned into a guide in the ROS 2 documentation? Could be a good start for a performance-oriented set of patterns and/or a ROS 2 tuning guide.

It doesn’t seem like there is a lack of knowledge on these subjects. It just seems to be hidden in developer minds and collective experience in organisations.

MiguelCompany · October 15, 2021, 9:46am

Apart from what has already been proposed, how about publishing the events directly, instead of a variable-length array of them? How would the example behave for a rate of 50Mhz of 16 bytes messages?

Bernd_Pfrommer · October 15, 2021, 6:22pm

Thanks for the support!

The link “ROS2 Default Behavior (Wifi)” was a very interesting one and I read many of the posts there. In particular because when filing the bug report against the cyclone DDS I came across @allenh1’s bug report about poor wifi performance, so this is apparently still an ongoing issue. Granted for wifi problems it’s impossible to provide a simple reproducible setup because the hardware involved is very non-standard and highly configurable. Still, the fact that experienced ROS/ROS2 engineers struggle with getting reasonable transmission rates over Wifi is troublesome.
Another part of the problem: the ROS community is just too nice. I deliberated for a while whether I should actually post this under a thread that’s called “ROS2 speed”, although my problem was very similar to the OP’s. Are you going to search for “speed” or “slow” or “performance issues” when you have issues? Probably the latter two. In a similar vein, if you were seeing terrible wifi performance under ROS2, would you think that’s the focus of a thread called “ROS2 Default Behavior (Wifi)”? Probably not. I don’t mean to criticize either of the posters for being civilized, in fact they stand out for taking time to raise the issue in the first place.

Along those lines, here are a suggestion: clearly document ROS2 performance issues vs ROS1 and point to possible workarounds. It will not be great advertisement for ROS2, but still better than people learning about it the hard way. I did not know how expensive the marshalling of messages is in ROS2 compared to ROS1, nor how expensive the unmarshalling is with python under ROS2. Documentation would also help for the wifi issues with known-to-work configuration examples etc.

If possible from a technical point, can the basic performance tools like e.g. “ros2 topic hz” written such that they work as expected or print warnings when they are not reliable? Can they be written in C++ or fixed by some other means?

Bernd_Pfrommer · October 15, 2021, 7:06pm

Under cyclonedds, I can send about 137 kHz of the messages below, under fastrtps about 130 kHz. No chance of scaling that to 50MHz.

uint16 x
uint16 y
uint64 ts
bool polarity

eboasson · October 16, 2021, 9:41am

Under cyclonedds, I can send about 137 kHz of the messages below, under fastrtps about 130 kHz. No chance of scaling that to 50MHz.

I can today squeeze 3MHz out of Cyclone (*) for that data type over the loopback interface by letting it batch samples in larger messages, something I am working on enabling properly so that it can also be used in ROS. There is definitely some room for improvement, so perhaps 5MHz could be done with some effort, but 50MHz is another factor of 10 … that’s not likely to ever happen.

With the Iceoryx integration I think it ought to be feasible but that, too, will take a bit of time.

So yes, I’m afraid you’ll have to accept using arrays …

(*) M1-based MBP, using Cyclone DDS master branch and with ddsperf hacked to support your data type:

# CYCLONEDDS_URI="<Gen><NetworkInterfaceAddress>lo0</><MaxMessageSize>65500B</></>"
# bin/ddsperf -TUKC sub & bin/ddsperf -TUKC pub & wait
[62678] 12.005 2.99M/s   0n |@=......................... ..|   0% 182u
[62678] 12.005  rss:6.1MB vcsw:607 ivcsw:1049 recvUC:10%+0% pub:71%+23%
[62677] 12.005  size 13 total 35811458 lost 0 delta 2992035 lost 0 rate 2992.00 kS/s 311.17 Mb/s (3000.37 kS/s 312.04 Mb/s)
[62677] 12.005  rss:3.5MB vcsw:18808 ivcsw:925 recvUC:76%+4%
[62678] 13.005    3M/s   0n |@=.........................  .|   0% 241u
[62678] 13.005  rss:6.1MB vcsw:611 ivcsw:884 recvUC:10%+0% pub:71%+23%
[62677] 13.005  size 13 total 38815968 lost 0 delta 3004510 lost 0 rate 3004.74 kS/s 312.49 Mb/s (3000.17 kS/s 312.02 Mb/s)
[62677] 13.005  rss:3.5MB vcsw:19886 ivcsw:805 recvUC:77%+4%

(20% of the publisher’s time is spent in sendmsg() macOS could do with a better network stack …)

Bernd_Pfrommer · October 26, 2021, 9:04pm

Some closing remarks on the ROS2 performance issues I encountered when writing a ROS2 driver for the event based Prophesee cameras.

The serialization/deserialization of structures in ROS2 is currently much slower than in ROS1 in particular if such structures contain other non-primitive data types like e.g. Time. To get high performance in ROS2 currently requires avoiding such situations. I ended up squeezing all event data into a uint64 field and then essentially packing/unpacking manually into that 64bit space. As a fringe benefit, the compressed data format also cut the storage and bandwidth requirements by almost a factor of two.

AFAIK Galactic currently does not support running rosbag record as a composable node. However with a small temporary tweak to the Recorder code I was able to write my own composable node that can store without inter-process communication.

With all the necessary optimizations in place (simple data types, composable nodes) ROS2 (cyclonedds) actually runs with slightly less CPU consumption than ROS1 as shown in this table (scroll all the way down). So it’s not like ROS2 can’t perform, it’s just that one has to work harder to get there.

I’m still curios though as to what the path forward is for fixing the serialization/deserialization performance issues. Can/will these be fixed by a more efficient rmw implementation?

gbiggs · October 26, 2021, 11:05pm

The serialisation and deserialisation are up to the implementation of CDR provided/used by the underlying DDS implementation. I’m not sure there’s much that the RMW implementation can do about it, since it just passes what is effectively a binary blob (the message data) to the serialiser and gets back another binary blob (the serialised data).

peci1 · October 27, 2021, 7:35am

@Bernd_Pfrommer could you create an issue in RMW for the slow (de)serialization? Even though @gbiggs thinks there’s not much RMW can improve, I think it’s important to have the issue there so that 1) people know about it, 2) TSC can start cooperating with the DDS vendors and look for solutions.

joespeed · October 28, 2021, 12:32am

Please do, we love GitHub issues to keep improving the Eclipse Foundation contributed ROS middleware. And hit us on gitter IM: cyclonedds, iceoryx, Zenoh.

What do you think about enabling cyclonedds’ built-in iceoryx shared memory by default?

Our tests show it improves these measures: latency, jitter, memory, cpu, throughput. Matthias at Apex.AI is preparing PR that makes it support dynamically sized messages for different QoS so users won’t need prior knowledge or configuration. here look for “Cyclone DDS with iceoryx Improvement” for an idea of what enabling it by default would do. Note that latency further improved since these tests were run because PR#348 is merged.

Bernd_Pfrommer · November 3, 2021, 12:13pm

Good idea. I had already created an issue here:

github.com/ros2/rmw_cyclonedds

slow publishing and performance for custom messages with large arrays

opened 08:48PM - 14 Oct 21 UTC

closed 09:47AM - 16 Oct 21 UTC

berndpfrommer

## Bug report **Required Info:** - Operating System: - Ubuntu 20.04 - …Installation type: - ROS2 galactic standard ubuntu package installed via apt - Version or commit hash: This is what the apt package info says: `` ros-galactic-cyclonedds/focal,now 0.8.0-5focal.20210608.002038 amd64 [installed,automatic]`` - DDS implementation: - cyclonedds - Client library (if applicable): - rclcpp - Hardware: AMD Ryzen 7 4800H with 64GB of memory #### Steps to reproduce issue I have made a very small repo with the below demo code and instructions how to run: https://github.com/berndpfrommer/ros2_issues Here is the source code for the publisher: ``` #include <unistd.h> #include <rclcpp/rclcpp.hpp> #include <ros2_issues/msg/test_array_complex.hpp> #include <thread> template <class MsgType> struct TestPublisher : public rclcpp::Node { explicit TestPublisher(const rclcpp::NodeOptions & options) : Node("test_publisher", options) { pub_ = create_publisher<MsgType>( "~/array", declare_parameter<int>("q_size", 1000)); thread_ = std::thread([this]() { rclcpp::Rate rate(declare_parameter<int>("rate", 1000)); const int numElements = declare_parameter<int>("num_elements", 100); rclcpp::Time t_start = now(); size_t msg_cnt(0); const rclcpp::Duration logInterval = rclcpp::Duration::from_seconds(1.0); while (rclcpp::ok()) { MsgType msg; msg.header.stamp = now(); msg.elements.resize(numElements); pub_->publish(msg); rate.sleep(); msg_cnt++; rclcpp::Time t = now(); const rclcpp::Duration dt = t - t_start; if (dt > logInterval) { RCLCPP_INFO(get_logger(), "pub rate: %8.2f", msg_cnt / dt.seconds()); t_start = t; msg_cnt = 0; } } }); } // -- variables typename rclcpp::Publisher<MsgType>::SharedPtr pub_; std::thread thread_; }; int main(int argc, char * argv[]) { rclcpp::init(argc, argv); auto node = std::make_shared<TestPublisher<ros2_issues::msg::TestArrayComplex>>( rclcpp::NodeOptions()); rclcpp::spin(node); rclcpp::shutdown(); return 0; } ``` It uses the following custom message for ``TestArrayComplex``: ``` std_msgs/Header header # test array of elements TestElement[] elements ``` and the TestElement of the array is defined as: ``` uint16 x uint16 y builtin_interfaces/Time ts bool polarity ``` #### Expected behavior Under ROS1 I can publish 1000 msgs/sec with 100,000 elements per message *and* receive at a rate of 1000Hz with ``rostopic hz`` #### Actual behavior Under ROS2 (galactic), already the publishing fails to keep up at a message size of 5,000 elements. Running the publisher with ``` RMW_IMPLEMENTATION=rmw_cyclonedds_cpp ros2 run ros2_issues publisher_node --ros-args -p num_elements:=5000 -p rate:=1000 ``` produces this output: ``` 634242922.738549 [0] publisher_: using network interface wlo1 (udp/192.168.1.234) selected arbitrarily from: wlo1, virbr0, virbr1, docker0 [INFO] [1634242923.747148904] [test_publisher]: pub rate: 369.34 [INFO] [1634242924.749220049] [test_publisher]: pub rate: 373.22 ... ``` So not even the publishing is full speed, without any subscriber to the topic. I see the publisher running at 100 %CPU, so something is really heavy weight about publishing. Worse, running ``rostopic hz`` shows a rate of about 30 msg/s. ``` RMW_IMPLEMENTATION=rmw_cyclonedds_cpp ros2 topic hz -w 100 /test_publisher/array 1634243031.116776 [0] ros2: using network interface wlo1 (udp/192.168.1.234) selected arbitrarily from: wlo1, virbr0, virbr1, docker0 average rate: 31.275 min: 0.030s max: 0.041s std dev: 0.00314s window: 33 average rate: 30.502 min: 0.030s max: 0.044s std dev: 0.00351s window: 63 ``` This is what I get from rostopic bw. The size of the message (about 80kb) agrees with what I computed by hand: ``` 5.20 MB/s from 100 messages Message size mean: 0.08 MB min: 0.08 MB max: 0.08 MB ``` I tried ``sudo` sysctl -w net.core.rmem_max=8388608 net.core.rmem_default=8388608`` and also was able to restrict the interface to loopback (lo) but no improvement. FastRTPS is a bit better, at least here I can send messages with up to 50,000 elements before it falls off at 110,000 messages: ``` RMW_IMPLEMENTATION=rmw_fastrtps_cpp ros2 run ros2_issues publisher_node_complex --ros-args -p num_elements:=110000 -p rate:=1000 [INFO] [1634243372.579736663] [test_publisher]: pub rate: 404.43 [INFO] [1634243373.581322637] [test_publisher]: pub rate: 391.37 [INFO] [1634243374.583210605] [test_publisher]: pub rate: 414.22 ``` But if I send messages of size 5,000, ``rostopic hz`` also shows about 30Hz, similar to rmw_cyclonedds_cpp. #### Additional information This is a show stopper for porting e.g. a driver for an event based camera from ROS1 to ROS2, see here: https://github.com/berndpfrommer/metavision_ros_driver. The hardware is a 8-core AMD Ryzen laptop, less than 1 year old, so definitely not a slow machine, and this is all running on-machine, no network traffic. To run the above code it is fastest to clone the [very small repo](https://github.com/berndpfrommer/ros2_issues) linked above.

and now asked for it to be kept open now for documentation purposes and for opportunity to improve it.

Patrick · March 22, 2025, 6:23pm

I am really sorry to dig out this post, but I am wondering what are the improvements here within 3 years?
Was that issue fully solved by now or improved?

@Bernd_Pfrommer @eboasson have you been able to redo these measurements with newer releases?

Bernd_Pfrommer · March 23, 2025, 1:55pm

@Patrick, the short answer is that progress has been marginal.

I just reran the numbers, albeit on faster hardware (Intel i7-14700K 64G RAM). All measurements are done intra-node, so no network involved. I picked an array size of 50000 elements, and cranked up the publisher frequency until ROS1 started to drop messages due to rostopic bw running out of CPU resources.

All frequencies are in Hz and given as requested/without subscriber/with subscriber

Publisher CPU% is given with the subscriber running.

ROS1:

publisher freq: 5000/5000/5000 (CPU 67%)
subscriber: 5000 (CPU 89%)

(Note: after a short while the subscriber cannot keep up (CPU goes to 100%) and receive frequency drops to 3600, not sure what’s happening there, probably something at the OS level)

ROS2 Zenoh:
publisher: 5000/1410/1200 (CPU 106%)
subscriber: 1200 (CPU 83%)

ROS2 CycloneDDS:
publisher: 5000/61/60 (CPU 100%)
subscriber: 60 (CPU 7%)

ROS2 FastRTPS:
publisher: 5000/1450/1340 (CPU 103%)
subscriber: 400-700 (strong fluctuations) (CPU 45%)

The bottom line is that the marshaling of non-primitive data types is still slow under ROS2. Zenoh outperforms FastRTPS on the subscriber side, but it looks like fundamentally they have the same bottleneck at the publisher.

Note that ros2 topic hz will report 10Hz and run at 100% CPU. Apparently fixing this is somewhat more involved, see this issue.

peci1 · March 23, 2025, 4:59pm

And now, this is C++. Based on my experience, Python (rclpy) is totally out of the game for these tasks.

sillkjc · March 23, 2025, 7:55pm

This should be the number 1 priorty above adding any features.

Bernd_Pfrommer · March 24, 2025, 12:28am

Couldn’t agree more. I’m not using any of the new features like QoS and live cycle nodes etc, but I’d like to at least recover the qualities of ROS1 with respect to ease of use and performance.

The rmw_zenoh_cpp package is coming along. The support is good, issues get addressed very quickly, and performance wise it does better than the existing RMWs, see above (I did not tune any DDS or Zenoh parameters, all is default). I also can get it configured without pulling my hair out. I’ve been using it exclusively for the last 9 months and can only recommend it. The more adoption, the quicker we get an RMW that is of ROS1 quality.

But what about the message format? Does the marshaling have to be so slow or can it be sped up? The layout/size of the data members is known at compile time, so why would marshaling be so slow? One would at most expect some host-to-net conversion (little/big endian) to be necessary. Implementation wise however this could be a very difficult change to make that may touch much of the core ROS2 code. Maybe some of the core ROS2 engineers can comment on that?

sillkjc · March 24, 2025, 1:16am

Concur completely, we’re a multinational ROS power user, working with hundreds of nodes and thousands of line of ROS code. We don’t need or use any of the QoS, configuration or headache provided by ROS2. We just want equivalent ROS1 ease of use and peformance. We want to send a message and have it get there, so we can debug the application and not the middleware.

Bernd_Pfrommer · March 24, 2025, 1:29am

Getting ros2 topic hz to work may not be as hard as improving python deserialization across the board.

I would imagine the python folks have worked around most these issues for the common case of e.g. images.

gbiggs · March 24, 2025, 3:48am

The foundation of serialisation and deserialisation in ROS 2 is provided by the RMW implementation: each RMW uses whatever is appropriate for that RMW. For example, the FastDDS RMW uses the CDR implementation provided by eProsima, and the Cyclone DDS one uses the CDR implementation provided by the Eclipse Cyclone DDS project. Serialisation speed is therefore fundamentally defined by the speed of the CDR implementation being used.

However, a key aspect of ROS 2’s serialisation is that it goes through the type support system. This consists of the generated serialisation and deserialisation functions that are created from templates when a message definition is compiled.

If there is a slowdown due to serialisation, it may be in either of these places (or both). Improving the speed of serialisation therefore requires someone to either:

profile the CDR implementation they are using, and find places to optimise it, or
profile the generated type support functions and find ways to generate faster code, without causing generation and compilation time to take too much longer than it already does, and without causing memory use of the generated code to rise too much.

Neither is easy, but we would love to see contributions in this area. Improving the speed of ROS isn’t flashy like a major new feature, but it benefits everyone, including the people who make that contribution. The second item is the harder one due to the flexibility of the type support system and the range of types it has to handle.

Topic		Replies	Views
ROS2 latency using different node setups General	31	9748	June 17, 2021
Why does the running speed of ros1 software become slower when ported to ros2? Next Generation ROS ros2 , ros	1	169	January 7, 2025
Diagnostic-aggregator and diagnostic-updater porting to ROS2 Next Generation ROS ros2	15	5261	January 24, 2019
Looking for a consultant experienced in ROS2, Docker, and data throughput issues Jobs ros2 , docker	8	2447	May 18, 2021
Unreliable communication using executors Next Generation ROS ros2 , ardent	3	950	April 24, 2018

ROS2 speed

Related topics