With the next Foxy release, the performance of ROS 2 applications will get a nice boost thanks to new features such as:
- the new
Waitsetclass implementation https://github.com/ros2/rclcpp/pull/1047
- the 1 participant per context re-mapping https://github.com/ros2/design/pull/250
However, there is still much work that can be done, especially to reduce the CPU usage of the application.
As it has been already highlighted in SingleThreadedExecutor creates a high CPU overhead in ROS 2
most of the overhead looks related to the use of the executors and waitsets.
We can identify some major contributors to this overhead:
- Modifying a waitset is an expensive operation. Currently this happens multiple times every iteration of the executor, even if the majority of ROS 2 systems is mostly static.
- The use of ROS 2 timers can be greatly improved. This is currently managed by the
rcllayer, where at each iteration the full list of timers associated to an executor is checked twice.
- The presence of so many layers (
rmw_xxx) between the application and the underlying middleware, makes the problem more complex, especially because these layers most of the times are not simply forwarding data, but rather performing non trivial operations.
As of today, running the iRobot benchmark application (1 process, 20 nodes) on a RaspberryPi platform the CPU usage is approximately 20%.
Here i want to present you an approach that we developed, that cuts the CPU usage from 20% to 6%.
It’s based on the idea that in a ROS 2 system we have 2 types of events: intra-process and inter-process events.
- Timers and Intra-process messages are intra-process events
- Inter-process messages are inter-process events
Currently both types of events are influenced by the whole ROS 2 stack and by the underlying middleware.
Even if you use intra-process communication, the synchronization primitives are managed by the
WaitSet and are sent from the application to the middleware.
Intra-process events are also highly impacted by spurious awakes, because even if the synchronization primitive is in the middleware, the predicate to be checked to understand if the system has to wake up is in the
rclcpp layer, and, as we have seen, going through all the layers has several issues.
In order to show how much the performance can be improved by investigating and tackling the overhead that occurs in all these layers, we decided to create a new executor, named the
The idea of this executor is that it only handles intra-process events and that it does that entirely within the
rclcpp layer, without sending anything down the stack.
Proof of concepts
We did several prototypes for this executor, also with the purpose of highlighting the overhead caused by each of the individual problems that affect the ROS 2 stack.
Instead of adding intra-process subscriptions to an executor, we created a separate thread for each of them. These threads are extremely simple and they only monitor when a new message is pushed into the intra-process subscription buffer, thus triggering the associated callback.
Note that we still used the ROS 2 synchronization primitives
This reduced the CPU usage to 14%
We substituted the
rcl_guard_condition_tused by these new threads with instances of
This reduced the CPU usage to 11%
We moved also timers outside of the executor and to separate threads. Each thread was just sleeping for the required amount of time, triggering the timer callback and then going back to sleep.
All this implemented using
This reduced the CPU usage to 9%
Then we decided to wrap up all what we learnt into an executor.
This is a new single thread executor with the following characteristics:
- It uses
std::mutexinstead of the ROS 2 synchronization primitives.
- Instead of 1 condition variable per intra-process subscription, it uses a single condition variable per executor.
- It uses an heap priority queue to reduce overhead while inspecting the timers
This reduced the CPU usage to 6%
You can find the implementation here
What we have done, shows a very efficient way for implementing single process ROS 2 applications.
This can be implemented either as a separate executor or integrated in an existing one to provide a Multi Thread executor that uses 1 thread for intra-process events and 1 thread for inter-process events.
We decided to tackle the intra-process case as it’s simpler and can lead to great results without almost any architectural change, as it can coexist with existing solutions.
However, similar improvements must be also applied to the inter-process case.
Possible solutions could consist in making the ROS 2 application to directly use the waitset provided by the DDS middleware and changing the ROS layers to be just forward this structure with a minimal overhead. This can be implemented in a middleware generic way by taking advantage of the DDS C++ APIs for example https://www.omg.org/spec/DDS-PSM-Cxx/
At the same time, the ROS 2 application should be considered generally static, to reduce the overhead on the system after all the nodes have been discovered.
At iRobot, we are currently investigating some prototypes and approaches for improving also this scenario and we will keep you posted.
Let’s keep improving ROS 2.