ROS2 latency using different node setups

Hi there,

as posted in this post, we recently submitted a paper to ICRA focussing on latency entailed by ROS2. We profile the latency to pinpoint the bottleneck and to find potential improvements. Different parameters are varied, e.g.

  • publisher frequency
  • number of nodes
  • DDS middleware
  • msg size

Have a closer look at the preprint: https://arxiv.org/pdf/2101.02074.pdf

Cheers,
urczf

12 Likes

This is some really great work and we appreciate third parties digging in to evaluating various ROS RMWs. I wanted to also point people to the TSC RMW Selection Report Open Robotics wrote a few months back. Skimming the conclusion it looks like we came to fairly similar conclusions but this paper is more eloquent in describing the cause of the issues.

@urczf have any plans to open source the code or data from this report?

4 Likes

Thanks for the kudos! The source code can be found on github. Regarding data, I will get back to you.

I will give the rmw selection report a read in the following days. Nice to hear that we came to similar conclusions.

1 Like

As promised, the (unprocessed) data can be found here.

Please note, we dumped basically anything to ease understanding of data. Please refer to the repository to understand the meaning of respective log files. I will update the README with the dataset as well.

1 Like

Hi @urczf ,

As you can see in the study pointed by @Katherine_Scott , Fast DDS is the best implementation in terms of latency, when you setup for sync publishing.

Until now, ROS2 has the requirement to be Async, and Fast DDS is configured that way for ROS2. CycloneDDS does not support this mode, and it is just sync, which favors a better latency.

See here for more information: RMW Fast DDS Publication Mode: Sync vs Async and how to change it

I think your study should take this into account.

1 Like

Hi @Jaime_Martin_Losa ,

yes I realized. On purpose, we decided to chose the default settings of the RMW. Nonetheless, we will run further evaluations this week for sync publishing and provide the results here. Thanks for letting us know. @joespeed This might also be interesting for you.

Tobias

Hi @urczf,

thanks for posting the paper here. It was very interesting to read. I collected some feedback from the Apex.AI team which I hope can be helpful to improve further the paper. In the following, you will find some comments for each paper section. Also, feel free to schedule a meeting to further discuss this topic.


I. INTRODUCTION

  1. “The middleware is an implementation of the DDS standard [7] that is widely used for distributed, real-time systems. Without changing much of the usercode of ROS1, the goal was to hide the DDS middleware and its API to the ROS2 user as shown in Fig. 1.”
    • Here it would be good to point out that ROS 2 rmw is not made to only abstract the DDS but also other middlewares, eCAL & iceoryx are examples.

II. RELATED WORK

III. METHODOLOGY

B. Parameter Space

  1. “Assuming that the sensor reading node only reads the sensor data and immediately publishes the message, this corresponds to the publisher frequency.”
    • For a more realistic scenario the data assembly and processing should be taken into account.

C. Measurement Metrics

  1. “As statistical quantity, we choose median as it is also used by [16] and resilient to outliers”
    1. It could be argued this might not be enough if you are targeting real-time systems.
    2. It would be more accurate to use 90%, 99%, 99.99%, max metrics. Or even better follow the approach proposed in Performance test results analysis and interpretation ¡ Issue #9 ¡ ros-realtime/community ¡ GitHub

D. Evaluation Framework

  1. “from a ROS2 perspective as opposed to performance_test that aims at fine-tuning QoS parameters and the DDS middleware”
    - This is not the aim of performance test. Please refer to “Performance Testing in ROS 2” paper. Specifically to “How to properly architect a performance test tool”. performance_test is also not DDS specific.

IV. EVALUATION

  1. The evaluation setup is not realistic enough in most cases (e.g: for an autonomous car setup). One of the main problems is you cannot consider that the data is published just after the sensor data is ready. The data assembly stage is missing, which is an important impact on the end-to-end latency.
  2. On the high-level the comparison between different middlewares is not A-to-A comparison for the following reasons:
    1. RMW implementation for Fastrtps, and CycloneDDS maps a DDS Domain participant to an rclcpp context, whereas RMW implementation for ConnextPro maps a DDS Domain participant to a ROS2 node
    2. Also since you use localhost and configure the setup to use UDP as a transport layer, CycloneDDS will bypass the entire network stack for the intra process communication and is not handled as a transport. (Probably these differences are also correlated to performance results)
  3. If your main focus is to measure the latency for real-time critical systems you should be using a real-time kernel and properly configured.

A. Evaluation of Publisher Frequency

  1. “Further, we can see that the obtained latencies for 100 B, 1 KB, 10 KB are equal, but increase for 100 KB and 500 KB”
    1. This is eliminated if you use a different transport protocol, e.g. zero-copy.

B. Evaluation of Scalability

  1. In our experience, using zero-copy mitigates greatly the dependency between the number of nodes and the latency. It would be interesting to compare the scalability results using a zero-copy.

C. Profiling

  1. “However, the overhead of ROS2 compared to raw DDS amounts up to 50 % for small message”
    1. This overhead can be greatly reduced by adding a few optimizations. As an example, ApexAI uses an optimized rmw implementation using “Zero overhead type conversion” and “PollingSubscription” as described in “Impacts on performance in ROS 2” section of “Performance Testing in ROS 2”. We are looking to improve the rmw implementations in the open so most of the overhead is very likely to be removed in the future.
  2. “rclcpp notification delay” => this is probably related to the scheduling mechanism of the executor. Using a waitset instead of using an executor would remove this contribution to the latency overhead.

D Influence of QoS Reliability

  1. “we cannot simulate a lossy network with the available parameter sets”. It is possible to simulate a lossy network without using remote machines. For example using tc => Use quality-of-service settings to handle lossy networks

V. CONCLUSION AND FUTURE OUTLOOK

  1. “higher the frequency, the lower the latency” => This conclusion could be misleading without the proper context cause this is not a direct effect from ROS or DDS but some unknown effect probably coming from the OS and/or the NIC.
7 Likes

Nice paper, I do have a question about the transport latency (LTE, WiFi) since they will be a significant part of the communications latency and any kind of measurement harness needs to account for some part of that as well.

Specifically with DDS and multicast there can be quite a lot of adverse effects with WiFi.

@HackToHell see 2020 ROS Middleware Evaluation Report | TSC-RMW-Reports.

Otherwise yes, there are issues with the DDS and multicast as elaborated here: ROS2 Default Behavior (Wifi) but we are working on it ROS2 Default Behavior (Wifi) - #9 by Jaime_Martin_Losa and there are tunable approaches ROS2 Default Behavior (Wifi) - #15 by Jaime_Martin_Losa.

Hi @carlossv ,

Thanks for your feedback! Regarding a meeting I will send you a message. Below, you can see our comments.


I. INTRODUCTION

  1. “The middleware is an implementation of the DDS standard [7] that is widely used for distributed, real-time systems. Without changing much of the usercode of ROS1, the goal was to hide the DDS middleware and its API to the ROS2 user as shown in Fig. 1.”
  • Here it would be good to point out that ROS 2 rmw is not made to only abstract the DDS but also other middlewares, eCAL & iceoryx are examples.

Great to point this out. We didn’t know that. As we studied the available design documents, tutorials and code base of ROS2 the only mentioned middleware was DDS. See the design docs for example. “For ROS 2 the decision has been made to build it on top of an existing middleware solution (namely DDS).”

Do you have a link to any official documenation regarding this point?

II. RELATED WORK

Thanks for bringing up more relevant literature.
As for your first suggestion, it helps us to classify our test approach as a “grey-box” or rather mixed, since we treat the OS and HW-layer as a black-box.
That being said, the linked document is a white-paper which is not (AFAIK) available as a submitted conference or journal paper. This might be the reason it slipped.
The evaluation report was published after we submitted our paper.

III. METHODOLOGY

The transport latency is not influenced by the processing time. Any processing time just appears on top of the transport latency. Also, the processing is very application-specific, whereas the transport is generic and occurs in every system equally.

Sure. That’s what we propose to do in future work.

D. EVALUATION FRAMEWORK

I agree. We were unspecific at this point. We wanted to point out that performance_test rather focusses on the communication architecture between subscriber/publisher instead of the node setup. Note the word “rather” :wink:

IV. EVALUATION

Sure, end -to-end latency is influenced by the data assembly as well. However, as mentioned before, the assembly is independent of the transport latency and just adds to the overall latency. As such, we present maybe a lower bound of the end-to-end latency in a real system.

From a user perspective we did an A-to-A comparison: We used the default middlewares and compared their performance. The users doesn’t care if it’s mapped to domain participants or contexts. Even, if he cared, he couldn’t change it easily for each middleware. As such, in my eyes, the comparison is reasonable.

Regarding Cyclone and Intra-Process, I’m not sure if that is true: We specifically used inter-process communication, i.e. the data went through the network stack. We did not allow ROS2 intra process comms. This is also visible in the profiling, that we always went down to dds_write and dds_take. However, it might be that a middleware detects that the subscriber is running in the same process as the publisher and then does not use UDP transparently. But, given that there is a big jump in the latency with packets bigger than 64k (the max UDP packet size), this is improbable.

Yes, a real-time kernel could have been used as well.

A. Evaluation of Publisher Frequency

zero-copy is only possible for intra-process communications. If you go over the network, you cannot do zero-copy, but the data is hacked into udp-packet-sized fragments.

B. EVALUATION OF SCALABILITY

Again, zero-copy only works for intra-process, which we did not care too much about. We focused on the latency with inter-process comms, to be able to split the system into independently operating nodes.

C. PROFILING

Regarding the custom MW: We we weren’t aware of that. We would be interested in the rmw!
Regarding notification delay: We used the implementation from rmw as available. Except fastrtps, they all use a waitset for notifying the system that a message has arrived.

D QoS

Good point. We will have a lock at that.

V. CONCLUSION AND FUTURE OUTLOOK

Yep, this should clearly be stated although reasons for this “rule of thumb” are explored and explained in the results section.

2 Likes

Hi everyone,

I just wanted to leave the latest latency performance evaluation we did on Fast DDS v2.2.0 here in case someone is interested. I think that it constitutes a good example of why just testing out-of-the-box performance might fall short for a large number of use cases.

Of course, I think it is really important to improve the out-of-the-box experience as much as possible, and in that sense, knowing what the default configurations can do is vital. That being said, I think that out-of-the-box performance as it is right now is good enough for a lot of use cases (if not most of them). I also believe that for those for which it is not, we should provide an evaluation of what the different options out there can do, since people leaning towards high performance configurations are already predisposed to make as many design choices and tweaks as necessary.

We are more than happy to help as much as we can on the design and interpretation of these kind results. In fact, the ROS 2 real-time Working Group is already having discussions about how to tackle this issues, you can follow them on this ticket. In my opinion, we still lack a complete and fair performance evaluation of ROS 2, so maybe we can all concentrate out efforts towards creating a very complete platform that gives relevant insight to all kinds of ROS 2 users and project, given that we are already invested in doing so.

2 Likes

Thanks for continuing the discussion here. Here is our feedback for the latest reply.

Great to point this out. We didn’t know that. As we studied the available design documents, tutorials and code base of ROS2 the only mentioned middleware was DDS. See the design docs for example. “For ROS 2 the decision has been made to build it on top of an existing middleware solution (namely DDS).”

Do you have a link to any official documenation regarding this point?

Yes, there is some information in the same document about this. See:

While ROS 2 only aims to support DDS based middleware implementations it can strive to keep the middleware interface free of DDS specific concepts to enable implementations of the interface using a different middleware.

Also, it is planned to add a new document explaining how to implement new RMWs. See “Documentation for implementing new RMWs” in <no title>.

Sure, end -to-end latency is influenced by the data assembly as well. However, as mentioned before, the assembly is independent of the transport latency and just adds to the overall latency. As such, we present maybe a lower bound of the end-to-end latency in a real system.

Here we would suggest making a clear separation between a realistic use case and a complete use case agnostic performance evaluation.

From a user perspective we did an A-to-A comparison: We used the default middlewares and compared their performance. The users don’t care if it’s mapped to domain participants or contexts. Even, if he cared, he couldn’t change it easily for each middleware. As such, in my eyes, the comparison is reasonable.

We think the user should be informed about these details. For example, in the paper you explain some of the reasons why rmw_connext_cpp has such poor results and that it has room for improvements. This information is relevant to the user. In the same way, it would be good to inform the user about critical differences when comparing the performance.

Regarding Cyclone and Intra-Process, I’m not sure if that is true: We specifically used inter-process communication, i.e. the data went through the network stack. We did not allow ROS2 intra process comms. This is also visible in the profiling, that we always went down to dds_write and dds_take. However, it might be that a middleware detects that the subscriber is running in the same process as the publisher and then does not use UDP transparently. But, given that there is a big jump in the latency with packets bigger than 64k (the max UDP packet size), this is improbable.

Since the ROS2 Intra process is disabled, ROS2 publishes the message, which essentially calls the DDS write(), now it is upon the DDS implementation on how these samples are delivered to the readers. In Cyclone DDS deliver_locally() https://github.com/eclipse-cyclonedds/cyclonedds/blob/b84c035ce0292cd76358475f069da4aec1540bb2/src/core/ddsc/src/dds_write.c#L162 will not use the network stack to deliver the samples to the local readers.

You could verify this by using ros2 tracing and checking that the network stack syscalls are not called.

But, given that there is a big jump in the latency with packets bigger than 64k (the max UDP packet size), this is improbable.

This is probably because of the max message size configured here https://github.com/Barkhausen-Institut/ros2_latency_evaluation/blob/master/config/qos_cyclonedds.xml

zero-copy is only possible for intra-process communications. If you go over the network, you cannot do zero-copy, but the data is hacked into udp-packet-sized fragments.

Zero-Copy can be used for inter-process communication. Please see e.g. the iceoryx talk from ROSCon 2019: https://roscon.ros.org/2019/talks/roscon2019_truezerocopy.pdf. It would be interesting to compare your results with a zero-copy middleware for intra-computer communication. Profiling inter-computer communication would be a whole different story as you also need to account for the parts below the protocol (data link, physical layer, switches).

Regarding the custom MW: We we weren’t aware of that. We would be interested in the rmw! Regarding notification delay: We used the implementation from rmw as available. Except fastrtps, they all use a waitset for notifying the system that a message has arrived.

Here we mean the rclcpp waitset. You can find more details here.

Here is a short explanation between the ‘Callback+Executor’ and rclcpp::waitset and the ApexOS PollingSubscription approach. When using the rclcpp executor, each subscription is automatically registered to the DDS wait-set without giving the developer control over the behavior. Therefore, every time a sample on any topic is received, a callback is executed. When using the Apex.OS PollingSubscription with a rclcpp wait-set, data must explicitly be requested from the PollingSubscription when needed.

In order to show some performance measurements for each case here we have some results we made recently comparing the following cases:

  1. ROS 2 executor+callback + Fast-RTPS (Dashing)
  2. Apex.OS Waitset+Polling Subscription using RTI Connext Micro
  3. RTI Connext Micro plugin

These are the results for each case:

  • Array2m, 1 publisher, 10 subscribers, 100 Hz
    • ROS 2, Callback
      • inter: 17ms - 23ms latency, 25% - 30% cpu
      • intra: 55ms - 65ms latency, 36% - 38% cpu
    • Apex.OS, Waitset
      • inter: 5ms - 7ms latency, 4% - 5% cpu
      • intra: 5ms - 7ms latency, 5% - 7% cpu
    • RTI Connext Micro plugin.
      • inter: 4ms - 6ms latency, 3% - 5% cpu
      • intra: 4ms - 6ms latency, 3% - 6% cpu
  • Array2m, 1 publisher, 1 subscriber, 500 Hz
    • ROS 2, Callback
      • inter: 1.3ms - 2.8ms latency, 3.1% - 4.8% cpu
      • intra: 1.8ms - 3.4ms latency, 8.3% - 12.2% cpu
    • Apex.OS, Waitset
      • inter: 0.7ms - 1.6ms latency, 3.1% - 4.9% cpu
      • intra: 1.1ms - 1.5ms latency, 3.0% - 5.2% cpu
    • RTI Connext Micro plugin
      • inter: 1.7ms - 2.4ms latency, 3.5% - 5.1% cpu
      • intra: 0.9ms - 1.1ms latency, 3.8% - 4.8% cpu
1 Like

@carlossv Thank your for mentioning eCAL and ice0ryx as alternative ROS2 rmw’s. We currently run ROS2 on top of the rmw_ecal and even eCAL (with or without ice0ryx binding) is quite fast for intraprocess or interprocess communication we loose a lot of performance in the rmw boilerplate for transforming the ROS message types in an RMW internal representation. So our focus moved away from optimizing the low level zero (or almost zero ;-)) copy mechanism to a better implementation of the type support (currently we have a dynamic one and probably the implementaion is not perfect …).

Therefore we will open source a static type support implementation based on Google Protobuf soon so that we can use the power of googles message format on the low level transport layer. eCAL has some nice tooling (like decentralized recording with multiple clients running distributed) that can be alternatively (or additional) used to rosbag2 and if the recorded content is based on Google Protobuf message format lot’s of options are available for postprocessing.

It would be great if our small contribution to the ROS2 eco system could be used in some specific scenarios (like AD vehicle development) and makes life easier :slight_smile:

2 Likes

Hi all,

I’m posting here the minutes from our meeting with @urczf and his colleagues.


  • It would be interesting to compare the results in a real-time system using suitable settings and measure additional metrics. => This will be part of future work.
  • We discussed about putting into context some statements regarding current ROS 2 performance for time critical systems. For example, the results show that ROS 2 has 50% overhead with respect DDS. This could discourage new users to use ROS 2 if they don’t have the proper context. This is true for the default configuration and rmw layers but this could be greatly optimized in the future. For example by optimizing the rmw implementation or using other approaches such as rclcpp waitset.
  • The paper goal was to test the performance by using ROS 2 default. It can be argued that ROS default implementation and settings are not currently suitable for time critical applications. It would require some additional optimizations and settings in order to achieve this.
  • In future work it would be interesting to compare the current results with:
    • rclcpp waitset
    • zero-copy with iceoryx
    • newest DDS versions

Thanks again @urczf for the discussion, looking forward to read your future work!

1 Like

ROS 2 adds both performance overhead to DDS and incompatibility issues, as messages from one ROS 2 distribution are not compatible with messages from other ROS 2 distributions.

Is it possible to create hybrid systems with both ROS 2 nodes and nodes that just use DDS directly?
Can DDS messages and topics be mapped back and forth between ROS 2 messages and topics for each distribution and be connected together so they can communicate properly?

Then the nodes that need the better performance or need to be able to communicate with more than one ROS 2 distribution could have the option of not including the additional ROS overhead.

An example might be an encoder board that needs to do real time processing and just wants to publish position values out to the world as simple DDS messages so ROS 2 nodes from various distributions can read them.

It is slightly more nuanced than that.

We certainly don’t test or guarantee that one distribution can talk to the next one. That is one way we can make changes to the messages over time, by making the changes to the “next” distribution.

But it is also the case that the messages in the core are fairly stable over time, and so things typically work release-to-release. Also, we typically will review/take fixes that fix interoperability problems. But nothing is guaranteed there.

In short, yes, with some limitations right now. I’ve been working on this recently; I have example code for both Fast-DDS and CycloneDDS at GitHub - osrf/ros2_raw_dds_example: A project showing how to connect a raw DDS program to a ROS 2 graph .

The main limitations at the moment is that there our open-source implementations don’t support DDS keyed data, and don’t support DDS optional data. If you avoid both of those in your “raw” DDS participant, you should be able to talk to ROS 2 just fine.

That is great, thanks!

So in a hybrid ROS 2 and pure DDS system, what message format would you recommend using as common data model for people to agree upon when exchanging information between nodes, the DDS message, the ROS message, or the idl?

Is the DDS message the most general format as it can be made to work in both pure ROS systems and pure DDS systems or a hybrid? Can the idl and the ROS message be generated from the DDS message, as long as you do not use keyed or optional data?

This sentence is a bit unclear.

Is the limitation in the available / used RMWs, or in the OSS version of the project you share?

That’s a good question, that sentence was unclear. As far as I understand, the backend Fast-DDS and CycloneDDS libraries support Keyed data just fine. That particular limitation is in the RMW implementations (rmw_fastrtps_cpp and rmw_cyclonedds_cpp, respectively).

The situation is less clear to me for Optional data. In my tests so far I have just been avoiding it, but one of my goals here is to understand where these limitations are.

It basically has to be the IDL, since that is the only thing that the “raw” DDS implementations understand. The good news is that the way ROS 2 messages work is that we go from .msg → .idl file, and then generate code from that. So the IDL is available for all ROS 2 messages.

Hi guys,

Curiously, we were looking for a way to record DDS topics using rosbag2 the other day, which entails making DDS and ROS 2 talk to each other. This is fairly simple, it basically consists on four steps:

  1. Generate a message type using ROS 2 format ( .msg )
  2. Convert the msg to IDL with rosidl_adapter.
  3. Generate a ROS 2 compatible type from the IDL using Fast DDS-gen
  4. Name the topic following ROS 2 topic naming conventions.

I’ll leave the instructions here in case you guys find it relevant.