We are working on running ROS 2 in an Embedded board and we find out that ROS 2 consumes high CPU because every ROS node is mapped to a DDS participant. We have performed some tests to investigate the issue and the tests and the results can be found at this link: https://github.com/nobleo/ros2_performance.
The roadmap of ROS 2 development mentions “Reconsider 1-to-1 mapping of ROS nodes to DDS participants” https://index.ros.org/doc/ros2/Roadmap/. We would like to see this happen rather sooner than later. We already observe that this leads to problems in CPU usage and can constrain people in their freedom to design an architecture for a robotic system. The ROS2 middleware should allow for a setting where everything can be grouped into a single DDS participant for the people that want to use nodes for modularity at the top level, but don’t want the code fragmented at the bottom level. Many use cases exist where one would like to create multiple nodes that all run on the same hardware. This is especially important since intra-process communication does not work effectively at the time of writing this post.
Does anyone face this same kind of problem?
We would like to discuss the idea of reconsidering the 1-to-1 mapping of ROS nodes to DDS participants here and would like the current 1-to-1 mapping implementation to change and would be willing to contribute to changing this if possible.
Instead of introducing an option the current idea is to associate the DDS participant with the context created during rmw_init. That would imply that common applications using a single init / context will only use a single DDS participant - even if they are composed of multiple ROS nodes.
If it gets implemented in time it will be available in the next ROS 2 r release which is Eloquent in Nov 2019.
Any help is appreciated. It will likely start with a design article to discuss the side affects of the intended change. E.g. the ROS node name is currently being used for the DDS participant name. When that mapping goes away there needs to be a replacement mechanism to communicate the node name.
After the initial ROS Answers post, we’ve also looked into this. In your answer, you mentioned that part of the CPU usage is caused by the executor itself. However, we found that this is really the main cause, rather than the DDS participant mapping. Therefore, if we want to lower the overhead (with relation to DDS), we should be looking at the executor as a whole.
Also, as a caution, we should not overlook the overhead that profiling adds, and how much it can really skew the results!
I’m working on a more in-depth analysis of the CPU usage of different parts of the executor. I’ll provide some results sometime next week.
As mentioned in our research, both the SingleThreadedExecutor and the 1-to-1 mapping of nodes to DDS participants appear to contribute to the large CPU overhead. We were planning to open a separate discourse discussion for the SingleThreadedExecutor optimization. This way the discussions don’t mix and both “problems” can be addressed. The link to the SingleThreadedExecutor discussion will appear on our github page soon.
It depends a bit on the usecase. Perf shows about a 50/50 cause with this test (10 nodes, 20 topics/publishers/timers and 200 subscribers). Changing these numbers will also change the usage numbers.
Also have a look at https://github.com/ros2/rclcpp/pull/778 which skips the entire dds for intraprocess communication
Hi, I’m the author of the intra-process communication PR mentioned by @scg.
First of all, thank you for the showing your results.
Using 1 participant per process (or context) is definitely an interesting idea.
Especially with Fast-RTPS since it would also reduce the memory usage a lot.
Fast-RTPS does not implement shared memory transportation yet, but it recognizes “local publications” i.e. messages where the publisher and the subscription are in the same participant. In this case the message is not sent over the network, but directly passed to the subscription.
I run your performance tests together with the new intra-process communication.
| CPU
rosonenode | 12%
rtps | 8%
This still does not reach the same results of Fast-RTPS alone for this particular example.
Note that the new Intra-process implementation adds an additional entity to the waitset of the nodes for each subscription (possibly slowing down the SingleThreadedExecutor).
Note that other RMW implementation allows to reduce CPU usage. For example CycloneDDS https://github.com/eboasson/cyclonedds where a sort of intra-process communication is already implemented.
The performance of the CycloneDDS are slightly worst than the ones of the rclcpp PR since in this case it is possible to easily skip serialization and to save some copies by knowing in advance all the subscriptions.
Not trying to debate the performance gains of this approach, but I think it’s worth pointing out that SROS 2 (and indeed DDS-Security in general) currently only supports security at the domain participant level. Using the same participant for multiple nodes will make security very difficult, as all nodes will have effectively the same identity and thus the same access control.
Hi @tomoyafujita,
Thank you for your comment. Should I raise an issue for this on rwm github page: https://github.com/ros2/rmw/issues? Or what is the common way of registering an issue? And in the issue can I link to our github or Discourse page?
@ivanpauno what is your expected timeline for these changes?
For someone who is new to this process (getting changes implemented into the core of ROS2) it is hard to judge how long implementing something like this could/would take.
I’m also very interested in getting a high level description of the entire process if this is possible.
The developer guide https://index.ros.org//doc/ros2/Contributing/Developer-Guide is written more with “normal” packages in mind. I don’t think a major overhaul like this (or for instance other changes to rclcpp, rcl and rmw that can have far reaching impact), are simply handled by creating multiple disjointed pull requests.
From watching from the sidelines what I’ve seen so far is that:
A discussion is started on discourse / an issue is raised on the github of the package
A design document is written together with community members (and members of the TSC)
The document is reviewed by (community members and) members of the TSC
?
This leads me to the following questions that maybe you or another member of the community could answer:
After writing the design document.
How are the next steps decided?
Who does the actual implementation, how is this decided?
How does one keep track of all the activities in multiple layers between multiple people?
Do multiple interested parties just respond to the issue/design document and figure something out from there? Is there a certain structure to this? Who is “responsible” for the final outcome?
Thanks in advance to anyone that can give me some clarity on the process.
Also thanks for all the hard work everyone has been putting in so far!