Reducing ROS 2 CPU overhead by simplifying the ROS 2 layers

alsora · April 24, 2020, 4:52pm

Hi all,

With the next Foxy release, the performance of ROS 2 applications will get a nice boost thanks to new features such as:

the StaticSingleThreadedExecutor https://github.com/ros2/rclcpp/pull/1034
the new Waitset class implementation https://github.com/ros2/rclcpp/pull/1047
the 1 participant per context re-mapping https://github.com/ros2/design/pull/250

However, there is still much work that can be done, especially to reduce the CPU usage of the application.

As it has been already highlighted in SingleThreadedExecutor creates a high CPU overhead in ROS 2
most of the overhead looks related to the use of the executors and waitsets.

We can identify some major contributors to this overhead:

Modifying a waitset is an expensive operation. Currently this happens multiple times every iteration of the executor, even if the majority of ROS 2 systems is mostly static.
The use of ROS 2 timers can be greatly improved. This is currently managed by the rcl layer, where at each iteration the full list of timers associated to an executor is checked twice.
The presence of so many layers (rclcpp, rcl, rmw_xxx) between the application and the underlying middleware, makes the problem more complex, especially because these layers most of the times are not simply forwarding data, but rather performing non trivial operations.

As of today, running the iRobot benchmark application (1 process, 20 nodes) on a RaspberryPi platform the CPU usage is approximately 20%.

Here i want to present you an approach that we developed, that cuts the CPU usage from 20% to 6%.

It’s based on the idea that in a ROS 2 system we have 2 types of events: intra-process and inter-process events.

Timers and Intra-process messages are intra-process events
Inter-process messages are inter-process events

Currently both types of events are influenced by the whole ROS 2 stack and by the underlying middleware.
Even if you use intra-process communication, the synchronization primitives are managed by the WaitSet and are sent from the application to the middleware.
Intra-process events are also highly impacted by spurious awakes, because even if the synchronization primitive is in the middleware, the predicate to be checked to understand if the system has to wake up is in the rclcpp layer, and, as we have seen, going through all the layers has several issues.

In order to show how much the performance can be improved by investigating and tackling the overhead that occurs in all these layers, we decided to create a new executor, named the RclcppExecutor.
The idea of this executor is that it only handles intra-process events and that it does that entirely within the rclcpp layer, without sending anything down the stack.

Proof of concepts

We did several prototypes for this executor, also with the purpose of highlighting the overhead caused by each of the individual problems that affect the ROS 2 stack.

Instead of adding intra-process subscriptions to an executor, we created a separate thread for each of them. These threads are extremely simple and they only monitor when a new message is pushed into the intra-process subscription buffer, thus triggering the associated callback.
Note that we still used the ROS 2 synchronization primitives rcl_wait and rcl_guard_condition_t,
This reduced the CPU usage to 14%
Implementation: Execute intra-process subscription callbacks in a separate thread rat… · alsora/rclcpp@ea3f97c · GitHub
We substituted the rcl_wait and rcl_guard_condition_t used by these new threads with instances of std::condition_variable.
This reduced the CPU usage to 11%
Implementation: use std::condition_variable instead of rcl_guard_condition_t · alsora/rclcpp@609dd6e · GitHub
We moved also timers outside of the executor and to separate threads. Each thread was just sleeping for the required amount of time, triggering the timer callback and then going back to sleep.
All this implemented using chrono time.
This reduced the CPU usage to 9%
Implementation: Create WallTimer class · irobot-ros/ros2-performance@16e30bd · GitHub

The `RclcppExecutor`

Then we decided to wrap up all what we learnt into an executor.
This is a new single thread executor with the following characteristics:

It uses std::condition_variable and std::mutex instead of the ROS 2 synchronization primitives.
Instead of 1 condition variable per intra-process subscription, it uses a single condition variable per executor.
It uses an heap priority queue to reduce overhead while inspecting the timers

This reduced the CPU usage to 6%

You can find the implementation here

Conclusions

What we have done, shows a very efficient way for implementing single process ROS 2 applications.
This can be implemented either as a separate executor or integrated in an existing one to provide a Multi Thread executor that uses 1 thread for intra-process events and 1 thread for inter-process events.

We decided to tackle the intra-process case as it’s simpler and can lead to great results without almost any architectural change, as it can coexist with existing solutions.

However, similar improvements must be also applied to the inter-process case.
Possible solutions could consist in making the ROS 2 application to directly use the waitset provided by the DDS middleware and changing the ROS layers to be just forward this structure with a minimal overhead. This can be implemented in a middleware generic way by taking advantage of the DDS C++ APIs for example About the ISO/IEC C++ 2003 Language DDS PSM Specification Version 1.0

At the same time, the ROS 2 application should be considered generally static, to reduce the overhead on the system after all the nodes have been discovered.

At iRobot, we are currently investigating some prototypes and approaches for improving also this scenario and we will keep you posted.

Let’s keep improving ROS 2.

ruffsl · April 24, 2020, 5:33pm

Could these improvements be generalized beyond just C++, so all client library languages can benefit?

tomoyafujita · April 25, 2020, 8:59pm

awesome, thank you so much for your effort!

that cuts the CPU usage from 20% to 6%.

do you have some kinda final target on this with reason or specific use case?

alsora · April 27, 2020, 10:12am

Thank you everyone for the interest shown.

do you have some kinda final target on this with reason or specific use case?

Our objective is to make ROS 2 as light weight as possible to allow who works with resource constrained embedded platform to be in a condition to consider a transition to ROS 2.
Our benchmark application allows to measure the performance of an arbitrary ROS 2 system that only does message passing with 1 or multiple processes.
1 year ago, the measured performance for a 20 nodes system on a Raspberry Pi 3 were around 35-40% CPU usage and more than 100Mb RAM usage.

When working with platforms with that computational power, you are already trying hard to squeeze everything you have to fit it in there, thus the overhead that would be introduced by a transition to ROS 2 becomes unsustainable.

In terms of crude numbers, we tried to roughly get a 20 nodes application to run in 15% CPU and 20 Mb on a Raspberry Pi 3 (quad core) and a 10 nodes application to run in 10 % CPU and 10 Mb on a Raspberry Pi 1 (single core).

Since we joined this community, we are proud to have contributed (with different degrees of involvement) to most of the relevant performance updates.
We consider that with this last update, our performance targets would be met for a single process application.
However, there is still much work to do in the multi-process scenario.

Could these improvements be generalized beyond just C++, so all client library languages can benefit?

These last improvements have been shaped in a way to be applied to the C++ layer, where intra-process communication is available.
This allowed us to get in a short time a proposal that is simple enough so that everyone can try it.

However, the main focus was also to highlight the bottlenecks of the current architecture, especially because all the problems mentioned in the first post are present also with inter-process communication.
If such a simple solution allows to get such noticeable improvements, I think that it’s worth discussing what can be changed in the ROS 2 layers.

I can imagine the following roadmap for merging the above features while trying to be as generic as possible:

Update the intra-process subscription to work with a single guard condition per executor.
Implement an heap priority queue for timers in the rcl layer so that it can be used by every executor (this would also allow to simplify it)
Implement the rclcpp executor: this will require an efficient way to access to condition variable without using the waitset and would use the queue implementation from rcl.

Then the focus can be moved to the rcl and rmw layers, where possibly the rcl and rmw waitsets should be removed to allow a direct access to the DDS waitset (or an extremely thin abstraction over it).

This second phase is still highly WIP for us, as we don’t have a prototype to measure the improvements yet, however, our idea is that this will provide a big benefit to all the client libraries.

We would like to have feedback and proposals from the ROS 2 community while working on it, in particular from the DDS vendors, as this is such an important change.
@wjwwood @Dejan_Pangercic @joespeed @Jaime_Martin_Losa

emersonknapp · April 27, 2020, 7:37pm

Something I’d like to bring up with the TSC is whether we can introduce C++ implementations under the rcl API - I understand that much of the advanced feature development benefits from the higher level language and its standard library, and also understand that we want a C API for easy integration to other language clients (e.g. Python, Rust, Java, etc) - what I’m not sure about right now is why we don’t put these extensions under rcl, implemented in C++, but exposing a C API for use. That, to me, seems like a great way to generalize these features to benefit everybody.

Loving this work! But I am very concerned about ROS 2 duplicating the ros_comm situation and completely reimplementing large amount of functionality in different language clients, or even worse making rclpy basically unusable in serious project because it lacks too many features.

ruffsl · April 28, 2020, 8:12am

what I’m not sure about right now is why we don’t put these extensions under rcl, implemented in C++, but exposing a C API for use. That, to me, seems like a great way to generalize these features to benefit everybody.

I’m a bit bias here with security and static analysis, but I wouldn’t like to see reliance of C++ creeping into rcl. When I worked in industrial communications at TI, I rarely encountered good C++ support for the low power embedded devices and vendor compilers we used. I’d rather keep with pure C a bit longer, making eventual migration to memory safe system languages supporting concurrency easier in the future.

emersonknapp · April 30, 2020, 9:26pm

Maybe an alternative would be to build rclpy as a pybind layer on top of rclcpp then? This doesn’t enable all other languages, but at least keeps the two top ones in much better feature parity.

Edit: more thoughts - I’m not super inclined to suggest that we rebuild the intraprocess stuff in C - I think you’re suggesting Go or Rust, but I don’t see that happening anytime soon (years, if even agreed as a thing to do?)

I see the theory that there’s pure-C support technically possible based on the layer architecture, but are there any rmw implementations out there that aren’t written in C++? I guess the counterpoint is that we want to leave the door open for one.

tomoyafujita · May 1, 2020, 12:33am

but are there any rmw implementations out there that aren’t written in C++?

cyclonedds is pure C implementation.

ruffsl · May 1, 2020, 12:57am

Neither of those languages friendly address the embedded device space however.

GC languages like Go don’t seem as good of a fit for high performance or realtime systems though:

But keeping with C now will make interoperability with system languages like Ada and Rust a lot easier in the future, as well as others that can used to formally prove correctness like Ocaml or Haskell:

https://rosettacode.org/wiki/Call_a_foreign-language_function

https://msrc-blog.microsoft.com/2019/07/16/a-proactive-approach-to-more-secure-code/

I’m not so sure such use cases would be as simple to support if rcl became reliant on a C++ standard.

Yep, there are a few middleware vendors for ROS2 that are written in C, RTI Connext also being one of them. In any case, from the Language Support section of the design docs, it seems C++ implementations where to be temporary measures, given the complexity/bloat from the C++ standard.

emersonknapp · May 5, 2020, 4:15pm

Thanks for a thorough response!

But keeping with C now will make interoperability with system languages like Ada and Rust a lot easier in the future, as well as others that can used to formally prove correctness like Ocaml or Haskell:

I think this misses the point. My suggestion was to put C++ implementation of certain feature sets underneath a C API, not to move to a C++ API. These could even be extension libraries that are not part of core rcl, and could be replaced over time by different implementations without breaking the API (but these are implementation details). The point being that integration of any client language that can speak to C would be unaffected, because the API would be pure C. No interoperability problems.

Assuming that we don’t want to do the above - because of lacking support for C++ stdlibs for embedded platforms, not because of client language interoperability concerns - the question to me becomes, when do we as a ROS 2 development community stop accepting major feature sets into the C++ language client? It seems that we’ve begun to set a precedent and slowly close door on one of the big promises of the ROS 2 project by diverging rclcpp so far from other language clients by having it provide functionality well outside the scope of the rcl base API.

The followup is, what do we do instead? Do we reimplement the rclcpp-specific features in C - or in a different language and expose a C API - so that rclcpp, rclpy et. al. can use those features? I am thinking of Intra-Process and Composition at top of mind.

I’d be interested to hear from @wjwwood @dirk-thomas @tfoote high level on this topic as well - though probably sometime next month would be an easier time to start such a conversation

Satco · October 12, 2021, 12:19pm

@emersonknapp Only trough this post I have understood, that rcl is the core ros library and not rclcpp. I always thought the core is written in C++ and Python comes as an extra, but it seems like both C++ and Python are extras to C. Has there been any discussion regarding these topics in the meantime?

Rayman · October 15, 2021, 2:47pm

I think there were 2 main reasons why the rcl is written in C:

biggest portability: every embedded system has at least a C compiler
it’s easier to create language bindings to a C API then C++ (name mangling etc)

Topic		Replies	Views
SingleThreadedExecutor creates a high CPU overhead in ROS 2 ROS General ros2	25	11423	April 2, 2020
ROS2 Middleware Change Proposal ROS General	33	7154	December 1, 2020
ROS 2 Performance Benchmarking ROS General ros2	3	444	June 23, 2025
The ROS 2 C++ Executors ROS General ros2 , rclcpp , wg-client-libraries	26	4369	December 26, 2024
Faster rclpy executor now in Rolling ROS General ros2 , rolling , python	4	795	April 8, 2025

Reducing ROS 2 CPU overhead by simplifying the ROS 2 layers

Proof of concepts

The RclcppExecutor

Conclusions

Related topics

The `RclcppExecutor`