as posted in this post, we recently submitted a paper to ICRA focussing on latency entailed by ROS2. We profile the latency to pinpoint the bottleneck and to find potential improvements. Different parameters are varied, e.g.
This is some really great work and we appreciate third parties digging in to evaluating various ROS RMWs. I wanted to also point people to the TSC RMW Selection Report Open Robotics wrote a few months back. Skimming the conclusion it looks like we came to fairly similar conclusions but this paper is more eloquent in describing the cause of the issues.
@urczf have any plans to open source the code or data from this report?
As promised, the (unprocessed) data can be found here.
Please note, we dumped basically anything to ease understanding of data. Please refer to the repository to understand the meaning of respective log files. I will update the README with the dataset as well.
As you can see in the study pointed by @Katherine_Scott , Fast DDS is the best implementation in terms of latency, when you setup for sync publishing.
Until now, ROS2 has the requirement to be Async, and Fast DDS is configured that way for ROS2. CycloneDDS does not support this mode, and it is just sync, which favors a better latency.
yes I realized. On purpose, we decided to chose the default settings of the RMW. Nonetheless, we will run further evaluations this week for sync publishing and provide the results here. Thanks for letting us know. @joespeed This might also be interesting for you.
thanks for posting the paper here. It was very interesting to read. I collected some feedback from the Apex.AI team which I hope can be helpful to improve further the paper. In the following, you will find some comments for each paper section. Also, feel free to schedule a meeting to further discuss this topic.
I. INTRODUCTION
âThe middleware is an implementation of the DDS standard [7] that is widely used for distributed, real-time systems. Without changing much of the usercode of ROS1, the goal was to hide the DDS middleware and its API to the ROS2 user as shown in Fig. 1.â
Here it would be good to point out that ROS 2 rmw is not made to only abstract the DDS but also other middlewares, eCAL & iceoryx are examples.
II. RELATED WORK
Here it would be good to cite the following related work:
âAssuming that the sensor reading node only reads the sensor data and immediately publishes the message, this corresponds to the publisher frequency.â
For a more realistic scenario the data assembly and processing should be taken into account.
C. Measurement Metrics
âAs statistical quantity, we choose median as it is also used by [16] and resilient to outliersâ
It could be argued this might not be enough if you are targeting real-time systems.
âfrom a ROS2 perspective as opposed to performance_test that aims at fine-tuning QoS parameters and the DDS middlewareâ
- This is not the aim of performance test. Please refer to âPerformance Testing in ROS 2â paper. Specifically to âHow to properly architect a performance test toolâ. performance_test is also not DDS specific.
IV. EVALUATION
The evaluation setup is not realistic enough in most cases (e.g: for an autonomous car setup). One of the main problems is you cannot consider that the data is published just after the sensor data is ready. The data assembly stage is missing, which is an important impact on the end-to-end latency.
On the high-level the comparison between different middlewares is not A-to-A comparison for the following reasons:
RMW implementation for Fastrtps, and CycloneDDS maps a DDS Domain participant to an rclcpp context, whereas RMW implementation for ConnextPro maps a DDS Domain participant to a ROS2 node
Also since you use localhost and configure the setup to use UDP as a transport layer, CycloneDDS will bypass the entire network stack for the intra process communication and is not handled as a transport. (Probably these differences are also correlated to performance results)
If your main focus is to measure the latency for real-time critical systems you should be using a real-time kernel and properly configured.
A. Evaluation of Publisher Frequency
âFurther, we can see that the obtained latencies for 100 B, 1 KB, 10 KB are equal, but increase for 100 KB and 500 KBâ
This is eliminated if you use a different transport protocol, e.g. zero-copy.
B. Evaluation of Scalability
In our experience, using zero-copy mitigates greatly the dependency between the number of nodes and the latency. It would be interesting to compare the scalability results using a zero-copy.
C. Profiling
âHowever, the overhead of ROS2 compared to raw DDS amounts up to 50 % for small messageâ
This overhead can be greatly reduced by adding a few optimizations. As an example, ApexAI uses an optimized rmw implementation using âZero overhead type conversionâ and âPollingSubscriptionâ as described in âImpacts on performance in ROS 2â section of âPerformance Testing in ROS 2â. We are looking to improve the rmw implementations in the open so most of the overhead is very likely to be removed in the future.
ârclcpp notification delayâ => this is probably related to the scheduling mechanism of the executor. Using a waitset instead of using an executor would remove this contribution to the latency overhead.
D Influence of QoS Reliability
âwe cannot simulate a lossy network with the available parameter setsâ. It is possible to simulate a lossy network without using remote machines. For example using tc => Use quality-of-service settings to handle lossy networks
V. CONCLUSION AND FUTURE OUTLOOK
âhigher the frequency, the lower the latencyâ => This conclusion could be misleading without the proper context cause this is not a direct effect from ROS or DDS but some unknown effect probably coming from the OS and/or the NIC.
Nice paper, I do have a question about the transport latency (LTE, WiFi) since they will be a significant part of the communications latency and any kind of measurement harness needs to account for some part of that as well.
Specifically with DDS and multicast there can be quite a lot of adverse effects with WiFi.
Thanks for your feedback! Regarding a meeting I will send you a message. Below, you can see our comments.
I. INTRODUCTION
âThe middleware is an implementation of the DDS standard [7] that is widely used for distributed, real-time systems. Without changing much of the usercode of ROS1, the goal was to hide the DDS middleware and its API to the ROS2 user as shown in Fig. 1.â
Here it would be good to point out that ROS 2 rmw is not made to only abstract the DDS but also other middlewares, eCAL & iceoryx are examples.
Great to point this out. We didnât know that. As we studied the available design documents, tutorials and code base of ROS2 the only mentioned middleware was DDS. See the design docs for example. âFor ROS 2 the decision has been made to build it on top of an existing middleware solution (namely DDS).â
Do you have a link to any official documenation regarding this point?
II. RELATED WORK
Thanks for bringing up more relevant literature.
As for your first suggestion, it helps us to classify our test approach as a âgrey-boxâ or rather mixed, since we treat the OS and HW-layer as a black-box.
That being said, the linked document is a white-paper which is not (AFAIK) available as a submitted conference or journal paper. This might be the reason it slipped.
The evaluation report was published after we submitted our paper.
III. METHODOLOGY
The transport latency is not influenced by the processing time. Any processing time just appears on top of the transport latency. Also, the processing is very application-specific, whereas the transport is generic and occurs in every system equally.
Sure. Thatâs what we propose to do in future work.
D. EVALUATION FRAMEWORK
I agree. We were unspecific at this point. We wanted to point out that performance_test rather focusses on the communication architecture between subscriber/publisher instead of the node setup. Note the word âratherâ
IV. EVALUATION
Sure, end -to-end latency is influenced by the data assembly as well. However, as mentioned before, the assembly is independent of the transport latency and just adds to the overall latency. As such, we present maybe a lower bound of the end-to-end latency in a real system.
From a user perspective we did an A-to-A comparison: We used the default middlewares and compared their performance. The users doesnât care if itâs mapped to domain participants or contexts. Even, if he cared, he couldnât change it easily for each middleware. As such, in my eyes, the comparison is reasonable.
Regarding Cyclone and Intra-Process, Iâm not sure if that is true: We specifically used inter-process communication, i.e. the data went through the network stack. We did not allow ROS2 intra process comms. This is also visible in the profiling, that we always went down to dds_write and dds_take. However, it might be that a middleware detects that the subscriber is running in the same process as the publisher and then does not use UDP transparently. But, given that there is a big jump in the latency with packets bigger than 64k (the max UDP packet size), this is improbable.
Yes, a real-time kernel could have been used as well.
A. Evaluation of Publisher Frequency
zero-copy is only possible for intra-process communications. If you go over the network, you cannot do zero-copy, but the data is hacked into udp-packet-sized fragments.
B. EVALUATION OF SCALABILITY
Again, zero-copy only works for intra-process, which we did not care too much about. We focused on the latency with inter-process comms, to be able to split the system into independently operating nodes.
C. PROFILING
Regarding the custom MW: We we werenât aware of that. We would be interested in the rmw!
Regarding notification delay: We used the implementation from rmw as available. Except fastrtps, they all use a waitset for notifying the system that a message has arrived.
D QoS
Good point. We will have a lock at that.
V. CONCLUSION AND FUTURE OUTLOOK
Yep, this should clearly be stated although reasons for this ârule of thumbâ are explored and explained in the results section.
I just wanted to leave the latest latency performance evaluation we did on Fast DDS v2.2.0 here in case someone is interested. I think that it constitutes a good example of why just testing out-of-the-box performance might fall short for a large number of use cases.
Of course, I think it is really important to improve the out-of-the-box experience as much as possible, and in that sense, knowing what the default configurations can do is vital. That being said, I think that out-of-the-box performance as it is right now is good enough for a lot of use cases (if not most of them). I also believe that for those for which it is not, we should provide an evaluation of what the different options out there can do, since people leaning towards high performance configurations are already predisposed to make as many design choices and tweaks as necessary.
We are more than happy to help as much as we can on the design and interpretation of these kind results. In fact, the ROS 2 real-time Working Group is already having discussions about how to tackle this issues, you can follow them on this ticket. In my opinion, we still lack a complete and fair performance evaluation of ROS 2, so maybe we can all concentrate out efforts towards creating a very complete platform that gives relevant insight to all kinds of ROS 2 users and project, given that we are already invested in doing so.
Thanks for continuing the discussion here. Here is our feedback for the latest reply.
Great to point this out. We didnât know that. As we studied the available design documents, tutorials and code base of ROS2 the only mentioned middleware was DDS. See the design docs for example. âFor ROS 2 the decision has been made to build it on top of an existing middleware solution (namely DDS).â
Do you have a link to any official documenation regarding this point?
Yes, there is some information in the same document about this. See:
While ROS 2 only aims to support DDS based middleware implementations it can strive to keep the middleware interface free of DDS specific concepts to enable implementations of the interface using a different middleware.
Also, it is planned to add a new document explaining how to implement new RMWs. See âDocumentation for implementing new RMWsâ in <no title>.
Sure, end -to-end latency is influenced by the data assembly as well. However, as mentioned before, the assembly is independent of the transport latency and just adds to the overall latency. As such, we present maybe a lower bound of the end-to-end latency in a real system.
Here we would suggest making a clear separation between a realistic use case and a complete use case agnostic performance evaluation.
From a user perspective we did an A-to-A comparison: We used the default middlewares and compared their performance. The users donât care if itâs mapped to domain participants or contexts. Even, if he cared, he couldnât change it easily for each middleware. As such, in my eyes, the comparison is reasonable.
We think the user should be informed about these details. For example, in the paper you explain some of the reasons why rmw_connext_cpp has such poor results and that it has room for improvements. This information is relevant to the user. In the same way, it would be good to inform the user about critical differences when comparing the performance.
Regarding Cyclone and Intra-Process, Iâm not sure if that is true: We specifically used inter-process communication, i.e. the data went through the network stack. We did not allow ROS2 intra process comms. This is also visible in the profiling, that we always went down to dds_write and dds_take. However, it might be that a middleware detects that the subscriber is running in the same process as the publisher and then does not use UDP transparently. But, given that there is a big jump in the latency with packets bigger than 64k (the max UDP packet size), this is improbable.
zero-copy is only possible for intra-process communications. If you go over the network, you cannot do zero-copy, but the data is hacked into udp-packet-sized fragments.
Zero-Copy can be used for inter-process communication. Please see e.g. the iceoryx talk from ROSCon 2019: https://roscon.ros.org/2019/talks/roscon2019_truezerocopy.pdf. It would be interesting to compare your results with a zero-copy middleware for intra-computer communication. Profiling inter-computer communication would be a whole different story as you also need to account for the parts below the protocol (data link, physical layer, switches).
Regarding the custom MW: We we werenât aware of that. We would be interested in the rmw! Regarding notification delay: We used the implementation from rmw as available. Except fastrtps, they all use a waitset for notifying the system that a message has arrived.
Here is a short explanation between the âCallback+Executorâ and rclcpp::waitset and the ApexOS PollingSubscription approach. When using the rclcpp executor, each subscription is automatically registered to the DDS wait-set without giving the developer control over the behavior. Therefore, every time a sample on any topic is received, a callback is executed. When using the Apex.OS PollingSubscription with a rclcpp wait-set, data must explicitly be requested from the PollingSubscription when needed.
In order to show some performance measurements for each case here we have some results we made recently comparing the following cases:
ROS 2 executor+callback + Fast-RTPS (Dashing)
Apex.OS Waitset+Polling Subscription using RTI Connext Micro
@carlossv Thank your for mentioning eCAL and ice0ryx as alternative ROS2 rmwâs. We currently run ROS2 on top of the rmw_ecal and even eCAL (with or without ice0ryx binding) is quite fast for intraprocess or interprocess communication we loose a lot of performance in the rmw boilerplate for transforming the ROS message types in an RMW internal representation. So our focus moved away from optimizing the low level zero (or almost zero ;-)) copy mechanism to a better implementation of the type support (currently we have a dynamic one and probably the implementaion is not perfect âŚ).
Therefore we will open source a static type support implementation based on Google Protobuf soon so that we can use the power of googles message format on the low level transport layer. eCAL has some nice tooling (like decentralized recording with multiple clients running distributed) that can be alternatively (or additional) used to rosbag2 and if the recorded content is based on Google Protobuf message format lotâs of options are available for postprocessing.
It would be great if our small contribution to the ROS2 eco system could be used in some specific scenarios (like AD vehicle development) and makes life easier
Iâm posting here the minutes from our meeting with @urczf and his colleagues.
It would be interesting to compare the results in a real-time system using suitable settings and measure additional metrics. => This will be part of future work.
We discussed about putting into context some statements regarding current ROS 2 performance for time critical systems. For example, the results show that ROS 2 has 50% overhead with respect DDS. This could discourage new users to use ROS 2 if they donât have the proper context. This is true for the default configuration and rmw layers but this could be greatly optimized in the future. For example by optimizing the rmw implementation or using other approaches such as rclcpp waitset.
The paper goal was to test the performance by using ROS 2 default. It can be argued that ROS default implementation and settings are not currently suitable for time critical applications. It would require some additional optimizations and settings in order to achieve this.
In future work it would be interesting to compare the current results with:
rclcpp waitset
zero-copy with iceoryx
newest DDS versions
Thanks again @urczf for the discussion, looking forward to read your future work!
ROS 2 adds both performance overhead to DDS and incompatibility issues, as messages from one ROS 2 distribution are not compatible with messages from other ROS 2 distributions.
Is it possible to create hybrid systems with both ROS 2 nodes and nodes that just use DDS directly?
Can DDS messages and topics be mapped back and forth between ROS 2 messages and topics for each distribution and be connected together so they can communicate properly?
Then the nodes that need the better performance or need to be able to communicate with more than one ROS 2 distribution could have the option of not including the additional ROS overhead.
An example might be an encoder board that needs to do real time processing and just wants to publish position values out to the world as simple DDS messages so ROS 2 nodes from various distributions can read them.
We certainly donât test or guarantee that one distribution can talk to the next one. That is one way we can make changes to the messages over time, by making the changes to the ânextâ distribution.
But it is also the case that the messages in the core are fairly stable over time, and so things typically work release-to-release. Also, we typically will review/take fixes that fix interoperability problems. But nothing is guaranteed there.
The main limitations at the moment is that there our open-source implementations donât support DDS keyed data, and donât support DDS optional data. If you avoid both of those in your ârawâ DDS participant, you should be able to talk to ROS 2 just fine.
So in a hybrid ROS 2 and pure DDS system, what message format would you recommend using as common data model for people to agree upon when exchanging information between nodes, the DDS message, the ROS message, or the idl?
Is the DDS message the most general format as it can be made to work in both pure ROS systems and pure DDS systems or a hybrid? Can the idl and the ROS message be generated from the DDS message, as long as you do not use keyed or optional data?
Thatâs a good question, that sentence was unclear. As far as I understand, the backend Fast-DDS and CycloneDDS libraries support Keyed data just fine. That particular limitation is in the RMW implementations (rmw_fastrtps_cpp and rmw_cyclonedds_cpp, respectively).
The situation is less clear to me for Optional data. In my tests so far I have just been avoiding it, but one of my goals here is to understand where these limitations are.
It basically has to be the IDL, since that is the only thing that the ârawâ DDS implementations understand. The good news is that the way ROS 2 messages work is that we go from .msg â .idl file, and then generate code from that. So the IDL is available for all ROS 2 messages.
Curiously, we were looking for a way to record DDS topics using rosbag2 the other day, which entails making DDS and ROS 2 talk to each other. This is fairly simple, it basically consists on four steps:
Generate a message type using ROS 2 format ( .msg )
Convert the msg to IDL with rosidl_adapter.
Generate a ROS 2 compatible type from the IDL using Fast DDS-gen
Name the topic following ROS 2 topic naming conventions.
Iâll leave the instructions here in case you guys find it relevant.