ROS Resources: Documentation | Support | Discussion Forum | Service Status | Q&A answers.ros.org

SingleThreadedExecutor creates a high CPU overhead in ROS 2

Hi all,

what do you think about this?

already checked if i can reduce cpu consumption, it does some but not a big deal…

will dig deeper.

Hi @tomoyafujita,

Thanks a lot for your efforts. We are also working on creating a static scheduler to see how much performance gain can be achieved. We will share our result as soon as we complete our work. Please keep sharing the results of your work.

1 Like

@ishugoel

got it, thanks!
we will do the same!

tomoya

Sorry for the delay! Here are the results of the investigation done by @iluetkeb and me.

Summary

We could replicate the earlier results, showing that the Executor consumes a lot of CPU. In that, we could distinguish two cases:

  1. when there are few or no messages (e.g., for a timer-driven node), then the wait_for_work method causes the majority of the overhead, with 70% of its time spent in rclcpp and only 30% (excluding waiting) spent in the RMW layer and below. We determined this using the “nopub”, pure timer benchmark.
  2. when there are many messages, the majority of the CPU usage – up to about 30% of one CPU core in our tests – is caused by the get_next_ready_executable function. This is pure rclcpp. We determined this using scg’s “ros” benchmark, which sends 10000 small messages per second.

Background

Compared to earlier work with similar results, we took care to minimize overhead and only count time spent actually on the CPU. Therefore, we consider not just the qualitative result, but also the absolute numbers to be trustworthy.

This has been non-trivial, because the executor calls very many, very short functions (mainly to do with weak_ptrs). This causes problems both for traditional profiling (which adds lots of overhead) and for sampling-based profiling (which may not notice these). Just to give an idea, initialising the nodes took at least a good 10 seconds when using gcc -finstrument-functions! Without profiling, it takes ~100 ms.

To achieve this, we use 1) explicit instrumentation of only the relevant high-level functions and 2) we capture scheduling events. This allows us to sum CPU time only when the thread is actually executing on the CPU.

Specifically, we only looked at the main SingleThreadedExecutor functions:

  • spin()
    • get_next_executable()
      • get_next_ready_executable()
      • wait_for_work()
    • execute_any_executable()

See our executor instrumentation for rclcpp.

Results

As mentioned before, based on scheduling information, we only count CPU time when the thread is running on the CPU, not when it is blocked.

We chose the ros test case, since it has the highest CPU usage. We traced it for a few seconds. The thread itself has a CPU usage of 55.87% (this is less than the 70% overall CPU usage reported earlier, because it does not include time spent in the dedicated middleware threads).

In our first analysis, we looked at wait_for_work in some detail, because of the high overhead numbers reported earlier.

As you can see, from the “function” bar, the core rcl_wait function indeed only takes ~32% CPU, the rest is Executor overhead. However, as you can also see from the “thread” bar, the whole method only makes up ~18% of the CPU usage of overall thread. This means that other parts of the Executor are more important.

Therefore, we took at step back and looked at the two high-level functions in spin: get_next_executable and execute_any_executable.

The ON CPU time for each function is compared to the whole thread and to the parent function. In this case, 79.21% of the CPU time for the whole thread is spent in get_next_executable vs. 8.22% for execute_any_executable. These numbers are similar to what has been visually reported before by Nobleo.

Since execute_any_executable is likely dominated by running user code, we took a closer look at the functions inside get_next_executable: get_next_ready_executable and wait_for_work.

Here, get_next_ready_executable represents 67.02% of get_next_executable's CPU time, and 53.09% of all the actual CPU time for the thread!

Looking at the code, get_next_ready_executable checks its lists of timers/subscriptions/services/clients/waitables and returns once it has found one that is ready to execute. As a side note, having to loop over all the lists would explain the large CPU usage difference between the ros test case and the rosonenode test case, since the latter has only one node.

If we look at the CPU usage for each function individually, we can see that get_next_ready_executable is indeed the most CPU-intensive function.

The full data is below.

depth function or thread overall duration (s) actual duration (s) CPU usage (actual / overall) (%) actual duration, wrt thread (%) actual duration, wrt parent function (%)
thread 2.93 1.64 55.87 100.00
0 spin 2.62 1.47 56.22 89.97
1 get_next_executable 2.44 1.30 53.16 79.21 88.04
2 get_next_ready_executable 0.87 0.87 99.46 53.09 67.02
2 wait_for_work 1.52 0.39 25.41 23.68 29.89
1 execute_any_executable 0.14 0.13 97.00 8.22 9.13

In conclusion, the executor should be optimized. Figuring out if – and which – executable is ready seems to take a lot of CPU time.

We used LTTng and the ros2_tracing & tracetools_analysis packages. The Jupyter notebook which was used to get the results above can be found here. This post can also be found here, which also shows how profiling overhead can really mess with the results.

4 Likes

Thank you Christophe, very nice results. Good to see that we came to the same conclusions, this makes our case even stronger. I’m currently working on posting an issue on the rclcpp github where I will reference this discussion. I think your findings will be very helpful!

Edit: The issue is now available here: https://github.com/ros2/rclcpp/issues/825

1 Like

btw, for reference with respect to the changes @tomoyafujita did: No single call is to blame. The main issue is that, for every single timer-invocation or message, the whole internal representation is traversed. In contrast, if you use the middleware directly, you can attach a listener directly to each communication objects. This avoids traversal completely.

However, the listener approach has the problem that we have very little control over when which message is being executed. That’s precisely why ROS 2 adds executors, and can even have different ones.

IMHO, it would help to look at the interface between rmw and the executor, to pass more information across and thus avoid traversal.

1 Like

i need more time to dig deeper but i do agree on this.

besides, since this is optimization, we might as well define reasonable goal to achieve.

tomoya

Just FYI,
create “execute_any_executable_list” and “get_next_ready_executable_list” to reap the executable event as much as possible in single iteration. (that is said if the multiple executables are ready to fire, number of iteration to reap the executables will be much less.)

so far, we do not see much improvement.

@ivanpauno, and all

could you take a look at the following PR?

thanks,
Tomoya

Hello everyone,

Our first POC for a Static version of the Executor can be found here https://github.com/nobleo/rclcpp-static-executor . This version works with the latest stable release of dashing giving the following results:


As you can see, the StaticExecutor decreases CPU usage significantly.

Our StaticExecutor has been added to rclcpp in such a way that the old functionality remains intact. To use our executor please follow the README. The package also contains dockerfiles to quickly inspect the CPU usage on your PC for different executors (the LET executor created by Bosch for micro-ROS is also included in this comparison).

If you try out the docker example please share your results. It would be even better if you could use our executor with your own source code. This way it can be tested for more use-cases. If you run into bugs, please let us know! We did make some assumptions with respect to the source code, given that this is a POC (assumptions are mentioned in the README).

We think this POC is a good first step to highlight possible performance gains. The final goal is to get an optimized Executor with proper scheduling mechanics in the core ros2 stack. We are currently working on a fork from ros2 master to create a proper PR for this version. We will keep you updated on the PR progress here.

3 Likes

Rather than a fork, you could probably provide your new executor as a separate library.

@MartinCornelis

Great!!! we will look into that.

Rather than a fork, you could probably provide your new executor as a separate library.

+1 on this.

thanks

Hello everyone,

For now we created this PR for rclcpp https://github.com/ros2/rclcpp/pull/873 . We are considering making the code a separate library. Having the static executor as an optional package would prevent the bloating of ROS2. However, the package would also require a maintainer. Since we are a relatively small team that plans on doing more work (creating more packages in the future), we have to consider if and what packages we want to maintain. The static executor is a relatively small package, so we could consider picking it up (this is an internal discussion we are yet to have).

Please leave your comments and thoughts on the code under the PR. Even if the PR does not get approved, we hope to at least draw attention to the CPU overhead of the current implementation.

3 Likes

Small update: We updated the dashing version of our static executor to be semi-dynamic. The node guard_conditions are used as event trigger to rebuild the wait-set and executable list. This means that when a subscriber, timer etc. is added during spin(), the executor will notice (by checking the guard_condition) and rebuild, making the use of the static executor less restrictive.

This updated version can (still) be found here rclcpp Dashing.

We will create a master (eloquent) version of this, but we first want to fix some Jenkins linter errors and do some clean up on our PR.

If you try out our code please share your results here. Please report any bugs you find. Possible optimizations are best posted on the PR when we apply the changes there.

2 Likes

Hey guys, sorry for not responding here earlier, I have been following the discussions with a lot of interest. I really appreciate all the work you guys did to identify the issues with the current implementation.

I’m also planning on doing a first review of the proposed static executor that @MartinCornelis posted as soon as I can. We’ve had a bit of a backlog on features into rclcpp, but hopefully we can make a lot of progress on executor related changes during the F-turtle sprint.

I just wanted to briefly mention a few other changes we’ve been wanting to do with respect to the executor, which have sort of been preventing me from nailing down design documentation and recommending courses of action on threads like this one.


First, I really want to change the executor design so that you can use more than one executor per node. At the moment the association is “an executor may have zero to many nodes, and a node may have (be associated with) zero to one executors”. In the future I’d like to see it be callback groups which are the most granular thing that can be associated with an executor. I believe this was one of the possible ways to improve the design mentioned in https://github.com/ros2/rclcpp/issues/825.

The other major change is that we’d like to create a “wait set” like class in rclcpp (we have the wait set in rcl already), so that users may choose to avoid the executor pattern all together and instead wait on items and decide how to handle them on their own. In this case, I think that callback groups and executors will not be used. I’m still thinking about all the implications and possible use cases (including mixed use of executors and wait sets). This isn’t directly affecting the discussions here, but it may have an impact as “waitables” like timers and subscriptions may no longer have to be associated with a callback group or executor, where as right now they must be in order to be used.

Finally, there’s a lot of interface clean up around the executor that’d I’d like to undertake, specifically to expose the scheduling logic (currently it’s very naive and hard coded) and also I’d like to refactor the “memory strategy” class. It has a very important purpose (allowing you to control any incidental memory allocations), but it’s current design is pretty hard to understand.


I haven’t decided if we should either, try to integrate the suggested changes and/or try to tackle the performance problems described here first and then make some of the changes I described above, or first make the architectural changes and then re-evaluate the feedback in this thread, or try and do them together somehow. Perhaps a compromise would be to do the architecture changes while also working with people in this thread to ensure proper tracing hooks and try to catch obvious performance issues as we go, and then look more changes we could make, e.g. a more static executor design and/or changes to rmw to provide more information from the middleware.

I hope this is something we can discuss in detail at ROSCon (for those who will be there) and at the real-time working group as well. We’ll do our best to summarize the discussions here too.

6 Likes

Hey @wjwwood,

Thank you very much for the kind words. We decided to pause our work on the PR for now specifically because of the points mentioned in your post.

The way we implemented the static executor atm gets the job done, but it would be even better if we could write an executor that captures multiple improvements at the same time. We are looking forward to the rclcpp changes planned for the Foxy release.

In the meantime we’ve separated the static_executor functionality from rclcpp and have written it as a separate library as requested by some users.
The Dashing and Eloquent versions can be found here:


To use the static_executor, please look at the README. By default the original executor will be used, you have to make changes to the package.xml CMakeLists.txt and your source code to actually use the static_executor.

We hope this separate library version can help some people out, while we all wait for the even more awesome executor that is planned for Foxy! :fox_face::robot:

3 Likes

Hi,
with the next ROS 2 release approaching quickly, I would like to revive this discussion.

At iRobot we are currently investigating the performance of ROS 2 on single core platforms, thus improving the executor and its related data-structures is crucial.

My colleague Mauro Passerino is currently working on improvements to the StaticExecutor proposed by @MartinCornelis, with the goal to have it merged in the Foxy release.

We run multiple tests using our benchmark application and we got the following results for a 10 nodes system:

  • SingleThreadedExecutor CPU usage: 72%
  • StaticSingleThreadedExecutor CPU usage 53%
  • StaticSingleThreadedExecutor + our changes: CPU usage 40%

You can find more details in the static executor PR

These are already great improvements, however I think it would be very productive to have a discussion about what other steps can be taken both for the next release as well as for the future of ROS.

@wjwwood @ivanpauno @tomoyafujita @Ingo_Lutkebohle and any one else in this thread, would you be interested in scheduling a meeting on these topics?

3 Likes

I’m currently trying to finish a pull request to kick off the changes to the executor design, and while doing it, I think I have decided to take the static executor pr first. I’m not 100% sure yet, but I’m leaning that way. But either way I intend to make some progress on that pr this week.

As for having a meeting about what to do in the future, that’s fine, but in the next two weeks I will be very busy trying to get the already planned features into the foxy release, with help from some others like @Ingo_Lutkebohle. So I don’t think there’s much time to add more items for this release, nor will it help get them in to have a lot of other meetings (at least for me personally), so I’d prefer to schedule this for a few weeks from now, but I’m happy to attend and contribute to it.

1 Like

@alsora I can invite you to the RTWG where the Executor topic is being regularly discussed: https://docs.google.com/document/d/1zBKwDUDeWvJNyCvjzYriaZQoZO2VYGWe1uxw5Xxn5cY/edit?usp=sharing

The meeting coordinates you can find in this calendar: https://index.ros.org/doc/ros2/Governance/#upcoming-ros-events.

1 Like

Thank you! I will try to join the next meeting!

I saw the PR for the refactor of the executors from @wjwwood https://github.com/ros2/rclcpp/pull/1047
We will probably make some comments there in the meanwhile.