Sorry for the delay! Here are the results of the investigation done by @iluetkeb and me.
Summary
We could replicate the earlier results, showing that the Executor consumes a lot of CPU. In that, we could distinguish two cases:
- when there are few or no messages (e.g., for a timer-driven node), then the
wait_for_work
method causes the majority of the overhead, with 70% of its time spent in rclcpp
and only 30% (excluding waiting) spent in the RMW layer and below. We determined this using the “nopub”, pure timer benchmark.
- when there are many messages, the majority of the CPU usage – up to about 30% of one CPU core in our tests – is caused by the
get_next_ready_executable
function. This is pure rclcpp
. We determined this using scg’s “ros” benchmark, which sends 10000 small messages per second.
Background
Compared to earlier work with similar results, we took care to minimize overhead and only count time spent actually on the CPU. Therefore, we consider not just the qualitative result, but also the absolute numbers to be trustworthy.
This has been non-trivial, because the executor calls very many, very short functions (mainly to do with weak_ptr
s). This causes problems both for traditional profiling (which adds lots of overhead) and for sampling-based profiling (which may not notice these). Just to give an idea, initialising the nodes took at least a good 10 seconds when using gcc -finstrument-functions
! Without profiling, it takes ~100 ms.
To achieve this, we use 1) explicit instrumentation of only the relevant high-level functions and 2) we capture scheduling events. This allows us to sum CPU time only when the thread is actually executing on the CPU.
Specifically, we only looked at the main SingleThreadedExecutor
functions:
-
spin()
-
get_next_executable()
get_next_ready_executable()
wait_for_work()
execute_any_executable()
See our executor instrumentation for rclcpp
.
Results
As mentioned before, based on scheduling information, we only count CPU time when the thread is running on the CPU, not when it is blocked.
We chose the ros
test case, since it has the highest CPU usage. We traced it for a few seconds. The thread itself has a CPU usage of 55.87% (this is less than the 70% overall CPU usage reported earlier, because it does not include time spent in the dedicated middleware threads).
In our first analysis, we looked at wait_for_work
in some detail, because of the high overhead numbers reported earlier.
As you can see, from the “function” bar, the core rcl_wait
function indeed only takes ~32% CPU, the rest is Executor overhead. However, as you can also see from the “thread” bar, the whole method only makes up ~18% of the CPU usage of overall thread. This means that other parts of the Executor are more important.
Therefore, we took at step back and looked at the two high-level functions in spin
: get_next_executable
and execute_any_executable
.
The ON CPU time for each function is compared to the whole thread and to the parent function. In this case, 79.21% of the CPU time for the whole thread is spent in get_next_executable
vs. 8.22% for execute_any_executable
. These numbers are similar to what has been visually reported before by Nobleo.
Since execute_any_executable
is likely dominated by running user code, we took a closer look at the functions inside get_next_executable
: get_next_ready_executable
and wait_for_work
.
Here, get_next_ready_executable
represents 67.02% of get_next_executable
's CPU time, and 53.09% of all the actual CPU time for the thread!
Looking at the code, get_next_ready_executable
checks its lists of timers/subscriptions/services/clients/waitables and returns once it has found one that is ready to execute. As a side note, having to loop over all the lists would explain the large CPU usage difference between the ros
test case and the rosonenode
test case, since the latter has only one node.
If we look at the CPU usage for each function individually, we can see that get_next_ready_executable
is indeed the most CPU-intensive function.
The full data is below.
depth |
function or thread |
overall duration (s) |
actual duration (s) |
CPU usage (actual / overall) (%) |
actual duration, wrt thread (%) |
actual duration, wrt parent function (%) |
|
thread |
2.93 |
1.64 |
55.87 |
100.00 |
|
0 |
spin |
2.62 |
1.47 |
56.22 |
89.97 |
|
1 |
get_next_executable |
2.44 |
1.30 |
53.16 |
79.21 |
88.04 |
2 |
get_next_ready_executable |
0.87 |
0.87 |
99.46 |
53.09 |
67.02 |
2 |
wait_for_work |
1.52 |
0.39 |
25.41 |
23.68 |
29.89 |
1 |
execute_any_executable |
0.14 |
0.13 |
97.00 |
8.22 |
9.13 |
In conclusion, the executor should be optimized. Figuring out if – and which – executable is ready seems to take a lot of CPU time.
We used LTTng and the ros2_tracing
& tracetools_analysis
packages. The Jupyter notebook which was used to get the results above can be found here. This post can also be found here, which also shows how profiling overhead can really mess with the results.