ROS2 speed

Hi guys,

I’ve been playing around with ROS2 quite a lot recently and I have to say, that it’s amazing. I have an application running on ROS1 Noetic, which requires some of the nodes to publish data at a relatively high frequency (700Hz). Everything works fine.

I wanted to port this software to ROS2, but before I did that, I simply tested the maximum frequency of simple publishers, which only publish the “Hello world” message in the String std message. What I’ve noticed is that if I use the ROS1 publisher, I can publish this data with a maximum frequency of 9000 Hz, if I test the same publisher with ROS2, the frequency gets no higher that 1000 Hz. This concerns me, because I really need the aforementioned nodes to publish data really fast.

The question is: Is ROS2 slower than ROS1 or is there something I’m missing?

Thank you guys for your answers in advance

7 Likes

thanks for posting and sharing experience.

sorry for answering questions with questions…but may I ask about the platform details such as CPU, OS, ROS distribution and so on? also i am curious that what kind of actual sensor data you are dealing with 9KHz, if i may.

I had experienced the same thing over the past months and I asked myself the same question. I think it is not related to computer performance or cpu. I made the experiments on the same computer

@lnotspotl @bekir_bostanci Are you using SingleThreadedExecutor in your tests?

Minimum valuable reproducer would help for further issue analysis. Wouldn’t you mind to share it with community?

1 Like

@klaxalk did some test some time ago on Foxy and he concluded the same even for 100 Hz (generated by ROS 2 timer) - his conclusion is “slow and irregular publishing”… Here’s the code: ros2_examples/publisher_example.cpp at master · ctu-mrs/ros2_examples · GitHub .

Hello,
Could you guys prepare synthetic minimal sample both for ros1 and ros2 runnable on ubuntu where i could observe this, for me to benchmark it and look for blamable commit?

Best regards,
Pawel Kunio

sob., 1.05.2021, 10:04 użytkownik Michael Orlov via ROS Discourse <ros@discoursemail.com> napisał:

3 Likes

Hi,
Thanks. Ill try to replicate it and disect for offending commit.

Best regards,
Pawel Kunio

sob., 1.05.2021, 10:39 użytkownik Martin Pecka via ROS Discourse <ros@discoursemail.com> napisał:

I don’t think this thread is about some particular commit that slowed down ROS2. It seems to me it is telling that maybe there is some inherent slowness from the very beginning. You can try bisecting rcl, but don’t get surprised if it never gets better… (but maybe I’m wrong :slight_smile: ).

I think we are bound to take another, deeper look into performance improvements of ros2. There is some promising work https://github.com/ros2/design/pull/305 on executors, but perhaps a dedicated, cross-repo performance WG would be a good idea. What do you think?

2 Likes

I think somebody above was claiming it worked better in ros1. If rcl was introed with first version of ros2, you may be right etc. Ill try to tinker with it and lets see what i find out.

sob., 1.05.2021, 12:56 użytkownik Martin Pecka via ROS Discourse <ros@discoursemail.com> napisał:

I didn’t use SingleThreadedExecutor.In my case I run at the bellow script. It is tutorial of ros2. However I think someone fixed max frequency at 1000 because if your loop is open and published some message, it looks like maximum frequency looks always 1000 hz and it is not changing.

#include

#include “minimal_composition/publisher_node.hpp”
#include “rclcpp/rclcpp.hpp”
#include “std_msgs/msg/string.hpp”

using namespace std::chrono_literals;

PublisherNode::PublisherNode(rclcpp::NodeOptions options)
: Node(“publisher_node”, options), count_(0)
{
publisher_ = create_publisher<std_msgs::msg::String>(“topic”, 10);
timer_ = create_wall_timer(
500ms, std::bind(&PublisherNode::on_timer, this));
}

void PublisherNode::on_timer()
{
auto message = std_msgs::msg::String();
message.data = "Hello, world! " + std::to_string(count_++);
RCLCPP_INFO(this->get_logger(), “Publisher: ‘%s’”, message.data.c_str());
publisher_->publish(message);
}

#include “rclcpp_components/register_node_macro.hpp”

RCLCPP_COMPONENTS_REGISTER_NODE(PublisherNode)

I will point out that the executors actually aren’t involved during the publishing of data; a call to publish goes straight through the rclcpp layer down to the DDS layer, and then out to the network.

However, if subscriptions are used to measure the rate (using ros2 topic hz or something similar), then that obviously does involve the executors.

What would be really interesting to see from someone is what the rate is on the publisher by doing measurements inside the publisher code. That will allow us to bisect the problem on either the publisher or subscription side.

The other thing to try here is different RMW implementations, and see if there is any difference between, say, Fast-RTPS and CycloneDDS.

5 Likes

Is this actually blocking until the message leaves the DDS publisher? Doesn’t it just store the message in some kind of publish queue?

It depends on the RMW implementation and how it is configured. Fast-RTPS supports both asynchronous mode (where a the publication is queued and a background thread sends it out), and synchronous mode (where the publish call blocks until the data is actually sent on the network). CycloneDDS only supports synchronous mode.

By default, in Foxy and earlier, we use Fast-RTPS in asynchronous mode. By default, in Galactic, we use CycloneDDS in synchronous mode. But all 3 modes are available to the user with non-default configurations.

3 Likes

Hi everyone,

I’m leaving a couple of link on how to set publication mode when using rmw_fastrtps_cpp (default up to ROS 2 Foxy):

Also, I want to make a couple of remarks:

  • As @clalancette points out, if what you want to measure is publication performance, then, the measurements should be taken in the publishing code. That being said, it’s important no point out that if there are no data recipients, the middleware would do close to nothing, so to get a valid measurement, you’d need someone listening on the other side.
  • Out-of-the-box configuration maybe not be the best option for your case. There is not one-size-fits-all configuration, so ROS 2 ships a good compromise for most use-cases. However, DDS offers way more possibilities that may be used to tune for specific cases such as yours. Of those, Synchronous publishing is merely one of them.

Hi All, Just wanted to add some of my own findings.

I am currently working with a CAN driver (https://github.com/astuff/kvaser_interface), which tends to publish at very high frequencies (1000 Hz+) since each message is a single CAN frame/message.

When I run the ROS1 driver attached to a radar sensor (which continuously sends CAN data), I get a very clean, consistent 1840 Hz when monitoring with rostopic hz /can_topic.
However, when I run the ROS2 driver (which is very similar in general implementation), I get a very inconsistent frequency when monitoring with ros2 topic hz /can_topic. The frequency will vary from 1300 to 1600, but never reach the true 1840.

However! When I run ROS2 driver, along with the ros1_bridge, and subsequently monitor the ROS1 bridge output with rostopic hz /can_topic I get a very clean, consistent 1840 Hz, while at the same time on the ROS2 side, the frequency still appears to vary significantly.

This raises a few questions:

  1. How is everyone measuring the topic frequency? I assume with the same default rostopic hz and ros2 topic hz command-line tools? Just making sure we are comparing apples to apples.
  2. Do these findings suggest that there is possibly no inherent issue with ROS2 pub/subs or the underlying DDS implementations? perhaps the ros2 topic hz tool is simply buggy or misleading? The ROS1 bridge is clearly receiving all the ROS2 data correctly, in order to send it into ROS1 at the correct frequency.

Happy to hear what others think, and if anyone else can reproduce the same behavior, using the ros1_bridge as a way to test.

I am running Ubuntu 20.04 with Noetic and Foxy, with default DDS settings.

2 Likes

Hi, for measuring the local (inter process) communication speed I setup simple ROS2 publish / subscribe application. You can set message size, runs and frequency (just as delay between two publications) and you will get the average latency finally. It helped me a lot to get a rough estimation how a specific ROS distro is performing on different OS / HW versions. Not perfect but I think a good “real world performance estimation”.
Please check https://github.com/rex-schilasky/ros2_latency_ipc and feel free to contribute :slight_smile:
Results for galactic (windows 10 and ubuntu 20.04) can be found here.

2 Likes

I’m currently developing a ROS2 driver for an event based camera. The sensor publishes data at very high rates (1khz) and fairly large messages (80kb).
Wrote the driver in ROS1: works like a charm.
Wrote it for ROS2 (Galactic on Ubuntu): terrible performance.
This is not my first bout with ROS2. I’ve gotten a bloody nose when trying to write ROS2 drivers for other sensors as well. RMW performance was terrible (back then under Eloquent) and I couldn’t figure out what was going on.
Before completely throwing in the towel with respect to ROS2, I decided to file a proper bug report against the rmw_cyclonedds.
I have also created a very small repo that demonstrates the issues I’m experiencing (the code has more details, it’s less than 100 lines).

So I’m trying to send custom messages (not something unusual for a fairly exotic sensor) which have a variable length array of a custom type struct (Event.msg), which looks like this:

uint16 x
uint16 y
builtin_interfaces/Time ts
bool polarity

Each Event has (ballpark) 2 + 2 + 8 + 1 =~ 16 bytes. The sensors delivers about 50 Mio events per second, so if we bunch them up in packets of 50,000 (ideally we’d be using even smaller ones), that means sending 800kb messages at 1000Hz.

ROS1

  • publication rate: 1000Hz
    No problem: can publish messages with 50,000 events (0.650MB) at 1000Hz
  • rostopic bw shows 650MB/s bandwidth
  • rostopic hz shows 1000Hz
    In fact, I can send up to 100,000 events per message (the next gen sensor will require 150k per message!), hit about 1.3GB/s in bandwidth, and ROS1 is still ok for both publisher and receiver.

ROS2 (with cyclone DDS:)

  • publication rate: 31 Hz! Instead of 1000Hz. The call to publish() takes forever to return. This is without even any subscriber for the topic! Why so slow? Beats me.
  • monitor the bandwidth with ros2 topic bw: 27Mb/s, message size 0.8MB.
  • rostopic hz: reports 3Hz. The same thing on ROS1 would report 1000Hz, just for comparison.

ROS2(with fastrtps DDS):

  • publication rate: 950Hz, instead of ROS1’s 1000Hz. The node is running at 95% CPU, so apparently just calling publish() without any subscribers attached involves significant overhead.
  • bandwidth: 550 MB/s, but the moment ros2 topic bw is run, the publication rate drops to 650Hz
  • rostopic hz: reports 3Hz, just like cyclonedds.

When I ran into similar issues a little more than a year ago I shrugged them off as “well, ROS2 may just not be quite ready for prime time, it’ll be sorted out”. Maybe I was just doing something wrong? But now it’s a year later and still I’m seeing absolutely wonky behavior when using some very basic ROS2 functionality. Apparently I’m not the only one.

For me this is a real show stopper for ROS2 adoption. I’m not saying this because I don’t like ROS, quite the opposite: the success of ROS hinges on the one of ROS2, and for ROS2 to succeed, such blatant performance issues need to be sorted out urgently. Preferably in a way that the developer/user does not have to become a RMW expert and tune a dozen parameters to get it to perform. ROS1 is exemplary when it comes to this (see above).

Lastly, I’m somewhat surprised that the complaints about rmw performance weirdness are not more massive. Am I among the few who just don’t get how to write ROS2 code? Or is everybody else just too nice to bring this up?

15 Likes

Hi Bernd,

Thank you for reporting this. I’m sorry things weren’t working well in this case, even after all of the things you tried. Thank you for taking the time and effort to create the careful writeup and the minimal test case that demonstrates the problem. That helps so much.

I was able to replicate your results, and then spent some time trying other approaches. The following sequence of actions seem to make it work nicely on Cyclone with everything at defaults (Cyclone DDS with its out-of-the-box configuration):

  1. change from “row-major” message alignment to “column-major”. In other words, instead of a list of small messages, instead try a single message with “columns” for each primitive, like uint16[] x, uint16[] y, etc. The reasoning is the same as between PointCloud and PointCloud2 in ROS: it simplifies things a lot for the serialization and deserialization systems.
  2. increase the maximum and default UDP buffer sizes for both the write and read buffers. This is generally required for DDS, because it hammers UDP very hard when dealing with large and/or fast messages. It’s good you already did this, because performance crashes at high data rates when those buffers fill up.
  3. use a C++ subscriber for the performance measurement rather than the Python-based system provided by the ros2 CLI tool.

With those three steps, it now shows 100% message delivery at the full speed of 1000 Hz for messages up through 20,000 elements on my machine. When sending 5000-element messages, top shows CPU usage of 13% on the publisher and 25% on the subscriber. I am sure that with extra effort to increase fragment sizes and other DDS configuration games, things could get much better, but I didn’t try any of that.

You could also try using fixed-size arrays instead of dynamic-length arrays inside the message. Although that would make the downstream code a bit more complex to ignore the “unused” part of the buffer, some RMW implementations will then be able to use additional techniques to speed things up.

You could also try to compose the driver node in the same process as whatever will receive the messages and process them, because then you can bypass a lot of comms layers.

I’ll make a PR to your test-case repository with my hacky implementation of the ideas of (1) and (3) above.

Cheers!
Morgan

12 Likes

Hi Bernd,

Your post caused quite a stir here!

One thing that might help is this tool that I just wrote:

It requires ROS 2 Galactic because it uses the new generic_subscriber to receive messages without deserialising them. This will, I think, allow it to measure bandwidth topic with minimal overhead. (If it can be improved, please do send in suggestions!) It should more efficient bandwidth measurement than the Python-based ros2 topic bw tool does, which could be useful. Maybe.

5 Likes