SingleThreadedExecutor creates a high CPU overhead in ROS 2

Hello,

We are looking into the performance of ROS 2 on Embedded boards and we find out that ROS 2 consumes high CPU because of the overhead introduced by SingleThreadedExecutor. We did some tests to profile the CPU usage and we observed that if we run 20 publishers and 200 subscribers in one ROS node, 70% of the CPU is consumed by SingleThreadedExecutor and 20% of the CPU is consumed by DDS implementation.

By running the same example in Fast RTPS directly, it consumes 3.5 times less CPU as compared to ROS 2. The tests that we have performed along with their results can be found in this link: https://github.com/nobleo/ros2_performance.

Is anyone else is also looking into measuring the CPU usage of ROS 2? Please share your findings here and let us know if we are doing something wrong.

Our current analysis suggests that the SingleThreadedExecutor needs to be optimized otherwise normal ROS 2 cannot work properly on ‘ARM A-class’ embedded boards. We are willing to look more into this problem and can help by performing more tests and providing feedback to improvements. Please let us know if there is any other way to contribute to this.

Thank you,
Ishu Goel

9 Likes

@christophebedard and myself are also currently looking at this, but I’m going to give him some time to work more on it by replying in his stead :wink: He has looked into this based on his tracing work, see https://gitlab.com/micro-ROS/ros_tracing/ros2_tracing It uses LTTng to directly instrument rclcpp and rcl.

First of all, many thanks for describing your results so openly and so early, particularly for providing the initial benchmark programs. This makes it much easier to compare and combine results.

In general, what I’ve heard from the OSRF and others is that people are somewhat aware of the inefficiencies in the executor, but nobody had exact numbers so far, and therefore this problem was so far not prioritized. I think this has now changed :smile:

Regarding your analysis, one thing I would caution is that “perf record” is a sampling approach. This means it can miss executions which are too short. I don’t think this compromises your results, but since you were asking, I wanted to mention it.

Therefore, in our work, we use LTTng, which integrates both perf event and userspace tracepoints. We have tried both instrumenting every function automatically (which has noticeable overhead), and manual instrumentation of just the most relevant functions. The latter is a bit more work, but also gives more precise results.

About the single-threaded executor, one thing that I noticed is that it operates in the following way:

  • a node starts spinning
  • receives a message (awakening from spin)
  • the executor calls get_next_ready_executable to retrieve the entity that has to handle the message
  • the message is handled by the subscription
  • the executor checks again all the registered entities
  • if no entities have work to do, the executor goes to sleep again

Could the overhead of the executor be due to the fact that after every message it checks again all the entities? this can potentially be very expensive in cases like the one that you tested (200 subscriptions in the same executor)

Moreover, the get_next_ready_executable() function in rclcpp/executor.cpp is marked as improve run to run efficiency of this function.
@wjwwood do you already have any idea on how it should be improved?

2 Likes

@alsora Yes, that is a big part of the overhead.

Since the wait_set only really needs to be update whenever there is a change to the entity list, it is likely that some of this effort could be avoided, or made less expensive. However, without having had a more serious look at the design, I cannot currently say what the best option would be. Maybe @wjwood or @dirk-thomas have some ideas.

I noticed it operates in the follow way (although I could have missed something):

There is a list of nodes, a node has multiple callbackgroups, a group has multiple executables (eg timer, subscription, client, service, any). So basically a tree: node -> group -> executable

  1. It populates a list of all executables by walking the mentioned tree into a memorystrategy. (promote weak to shared_ptr)
  2. The memory strategy is then converted into a wait-set (to call the rcl)
  3. The wait set is waited upon. Implementation is all the way down into the RMW layer. It differs per RMW implementation.
  4. After the wait, only ready executables are left in the waitset (not null).
  5. The memory strategy is updated with this list (remove all that are not ready, to allow weak_ptr to cease)
  6. For a ready executable, the group is retrieved from the original tree by searching the entire tree.
  7. Execute
  8. Go back to step 6, if more executables where ready, otherwise go to step 1
  • The tree can be quite large, and is walked often. There seems to be room for improvement by walking/copying/searching the list of executables less (mainly step 5 & 6).

  • It’s all weak_ptr by design, but it must keep the memory valid (shared_ptr) as long as it’s in rcl_wait, which is a bit conflicting. It’s also the reason for lots of lookups in the original tree: the only way to see if something disappeared is to rebuild it.

  • Is there some information on the design somewhere? Typically a executor works by just submitting executables/callables/callback to a thread(pool), where the executor just maintains a queue of work to do. This is a more complicated design that is implemented in three different layers (rclcpp/rcl/rmw).

  • I also saw something on the roadmap about changing the relation between nodes/groups en refactoring the executor. Are there already more concrete ideas about this?

3 Likes

@alsora

Could the overhead of the executor be due to the fact that after every message it checks again all the entities?

i had the same concern with quick code scan and tried the following patch if it affects cpu consumption,

diff --git a/rclcpp/src/rclcpp/executor.cpp b/rclcpp/src/rclcpp/executor.cpp
index 2a47b71..b45bc19 100644
--- a/rclcpp/src/rclcpp/executor.cpp
+++ b/rclcpp/src/rclcpp/executor.cpp
@@ -595,7 +595,7 @@ Executor::get_next_executable(AnyExecutable & any_executable, std::chrono::nanos
       return false;
     }
     // Try again
-    success = get_next_ready_executable(any_executable);
+    // success = get_next_ready_executable(any_executable);
   }
   // At this point any_exec should be valid with either a valid subscription
   // or a valid timer, or it should be a null shared_ptr

so far it DOES NOT reduce cpu consumption.

my environment is
docker: 18.09.8 ros:dashing
Host: Ubuntu 16.04.6 LTS / Intel(R) Core™ i7-4790 CPU @ 3.60GHz

Hi all,

what do you think about this?

already checked if i can reduce cpu consumption, it does some but not a big deal…

will dig deeper.

Hi @tomoyafujita,

Thanks a lot for your efforts. We are also working on creating a static scheduler to see how much performance gain can be achieved. We will share our result as soon as we complete our work. Please keep sharing the results of your work.

1 Like

@ishugoel

got it, thanks!
we will do the same!

tomoya

Sorry for the delay! Here are the results of the investigation done by @iluetkeb and me.

Summary

We could replicate the earlier results, showing that the Executor consumes a lot of CPU. In that, we could distinguish two cases:

  1. when there are few or no messages (e.g., for a timer-driven node), then the wait_for_work method causes the majority of the overhead, with 70% of its time spent in rclcpp and only 30% (excluding waiting) spent in the RMW layer and below. We determined this using the “nopub”, pure timer benchmark.
  2. when there are many messages, the majority of the CPU usage – up to about 30% of one CPU core in our tests – is caused by the get_next_ready_executable function. This is pure rclcpp. We determined this using scg’s “ros” benchmark, which sends 10000 small messages per second.

Background

Compared to earlier work with similar results, we took care to minimize overhead and only count time spent actually on the CPU. Therefore, we consider not just the qualitative result, but also the absolute numbers to be trustworthy.

This has been non-trivial, because the executor calls very many, very short functions (mainly to do with weak_ptrs). This causes problems both for traditional profiling (which adds lots of overhead) and for sampling-based profiling (which may not notice these). Just to give an idea, initialising the nodes took at least a good 10 seconds when using gcc -finstrument-functions! Without profiling, it takes ~100 ms.

To achieve this, we use 1) explicit instrumentation of only the relevant high-level functions and 2) we capture scheduling events. This allows us to sum CPU time only when the thread is actually executing on the CPU.

Specifically, we only looked at the main SingleThreadedExecutor functions:

  • spin()
    • get_next_executable()
      • get_next_ready_executable()
      • wait_for_work()
    • execute_any_executable()

See our executor instrumentation for rclcpp.

Results

As mentioned before, based on scheduling information, we only count CPU time when the thread is running on the CPU, not when it is blocked.

We chose the ros test case, since it has the highest CPU usage. We traced it for a few seconds. The thread itself has a CPU usage of 55.87% (this is less than the 70% overall CPU usage reported earlier, because it does not include time spent in the dedicated middleware threads).

In our first analysis, we looked at wait_for_work in some detail, because of the high overhead numbers reported earlier.

As you can see, from the “function” bar, the core rcl_wait function indeed only takes ~32% CPU, the rest is Executor overhead. However, as you can also see from the “thread” bar, the whole method only makes up ~18% of the CPU usage of overall thread. This means that other parts of the Executor are more important.

Therefore, we took at step back and looked at the two high-level functions in spin: get_next_executable and execute_any_executable.

The ON CPU time for each function is compared to the whole thread and to the parent function. In this case, 79.21% of the CPU time for the whole thread is spent in get_next_executable vs. 8.22% for execute_any_executable. These numbers are similar to what has been visually reported before by Nobleo.

Since execute_any_executable is likely dominated by running user code, we took a closer look at the functions inside get_next_executable: get_next_ready_executable and wait_for_work.

Here, get_next_ready_executable represents 67.02% of get_next_executable's CPU time, and 53.09% of all the actual CPU time for the thread!

Looking at the code, get_next_ready_executable checks its lists of timers/subscriptions/services/clients/waitables and returns once it has found one that is ready to execute. As a side note, having to loop over all the lists would explain the large CPU usage difference between the ros test case and the rosonenode test case, since the latter has only one node.

If we look at the CPU usage for each function individually, we can see that get_next_ready_executable is indeed the most CPU-intensive function.

The full data is below.

depth function or thread overall duration (s) actual duration (s) CPU usage (actual / overall) (%) actual duration, wrt thread (%) actual duration, wrt parent function (%)
thread 2.93 1.64 55.87 100.00
0 spin 2.62 1.47 56.22 89.97
1 get_next_executable 2.44 1.30 53.16 79.21 88.04
2 get_next_ready_executable 0.87 0.87 99.46 53.09 67.02
2 wait_for_work 1.52 0.39 25.41 23.68 29.89
1 execute_any_executable 0.14 0.13 97.00 8.22 9.13

In conclusion, the executor should be optimized. Figuring out if – and which – executable is ready seems to take a lot of CPU time.

We used LTTng and the ros2_tracing & tracetools_analysis packages. The Jupyter notebook which was used to get the results above can be found here. This post can also be found here, which also shows how profiling overhead can really mess with the results.

4 Likes

Thank you Christophe, very nice results. Good to see that we came to the same conclusions, this makes our case even stronger. I’m currently working on posting an issue on the rclcpp github where I will reference this discussion. I think your findings will be very helpful!

Edit: The issue is now available here: https://github.com/ros2/rclcpp/issues/825

1 Like

btw, for reference with respect to the changes @tomoyafujita did: No single call is to blame. The main issue is that, for every single timer-invocation or message, the whole internal representation is traversed. In contrast, if you use the middleware directly, you can attach a listener directly to each communication objects. This avoids traversal completely.

However, the listener approach has the problem that we have very little control over when which message is being executed. That’s precisely why ROS 2 adds executors, and can even have different ones.

IMHO, it would help to look at the interface between rmw and the executor, to pass more information across and thus avoid traversal.

1 Like

i need more time to dig deeper but i do agree on this.

besides, since this is optimization, we might as well define reasonable goal to achieve.

tomoya

Just FYI,
create “execute_any_executable_list” and “get_next_ready_executable_list” to reap the executable event as much as possible in single iteration. (that is said if the multiple executables are ready to fire, number of iteration to reap the executables will be much less.)

so far, we do not see much improvement.

@ivanpauno, and all

could you take a look at the following PR?

thanks,
Tomoya

Hello everyone,

Our first POC for a Static version of the Executor can be found here https://github.com/nobleo/rclcpp-static-executor . This version works with the latest stable release of dashing giving the following results:


As you can see, the StaticExecutor decreases CPU usage significantly.

Our StaticExecutor has been added to rclcpp in such a way that the old functionality remains intact. To use our executor please follow the README. The package also contains dockerfiles to quickly inspect the CPU usage on your PC for different executors (the LET executor created by Bosch for micro-ROS is also included in this comparison).

If you try out the docker example please share your results. It would be even better if you could use our executor with your own source code. This way it can be tested for more use-cases. If you run into bugs, please let us know! We did make some assumptions with respect to the source code, given that this is a POC (assumptions are mentioned in the README).

We think this POC is a good first step to highlight possible performance gains. The final goal is to get an optimized Executor with proper scheduling mechanics in the core ros2 stack. We are currently working on a fork from ros2 master to create a proper PR for this version. We will keep you updated on the PR progress here.

3 Likes

Rather than a fork, you could probably provide your new executor as a separate library.

@MartinCornelis

Great!!! we will look into that.

Rather than a fork, you could probably provide your new executor as a separate library.

+1 on this.

thanks

Hello everyone,

For now we created this PR for rclcpp https://github.com/ros2/rclcpp/pull/873 . We are considering making the code a separate library. Having the static executor as an optional package would prevent the bloating of ROS2. However, the package would also require a maintainer. Since we are a relatively small team that plans on doing more work (creating more packages in the future), we have to consider if and what packages we want to maintain. The static executor is a relatively small package, so we could consider picking it up (this is an internal discussion we are yet to have).

Please leave your comments and thoughts on the code under the PR. Even if the PR does not get approved, we hope to at least draw attention to the CPU overhead of the current implementation.

3 Likes

Small update: We updated the dashing version of our static executor to be semi-dynamic. The node guard_conditions are used as event trigger to rebuild the wait-set and executable list. This means that when a subscriber, timer etc. is added during spin(), the executor will notice (by checking the guard_condition) and rebuild, making the use of the static executor less restrictive.

This updated version can (still) be found here rclcpp Dashing.

We will create a master (eloquent) version of this, but we first want to fix some Jenkins linter errors and do some clean up on our PR.

If you try out our code please share your results here. Please report any bugs you find. Possible optimizations are best posted on the PR when we apply the changes there.

2 Likes