The ROS 2 C++ Executors

Hi,

starting this topic to spread some awareness around the current status of ROS 2 executors in the rclcpp C++ client library and to discuss their future.
Let’s have a quick look at the available executor classes.

First we have probably the two most famous executors:

  • The SingleThreadedExecutor: this is the default executor for ROS 2 (and the first developed). It uses wait-sets and processes events in an arbitrary order.
  • The MultiThreadedExecutor: this is a multi-threaded version of the SingleThreadedExecutor.

The initial implementation of these executors had major performance issues.
So the ROS community spent a good amount of time trying to improve them.
This resulted in a lot of different executors, both open and closed source.
I suggest you to look at the ROSCon 2021 “Executors Workshop”.

Out of all this work, two new executors got into rclcpp:

  • The StaticSingleThreadedExecutor: this executor was developed by Nobleo and then improved by iRobot. Its initial implementation was focused on the idea that ROS 2 systems are static at steady-state, so you don’t need to rebuild the list of entities at every iteration. Despite the name, this executor is fully usable also for non-static systems.
  • The EventsExecutor: this executor was developed by iRobot. Its main difference with respect to all the other executors is that it doesn’t use the concept of wait-set, but rather it is based on an events-queue. This results in not paying overhead for entities that are not receiving events and in the possibility of processing events in the correct arrival order.
    It uses the same concept as the static executor to avoid rebuilding the list of entities.

As part of ROS 2 Jazzy, the rclcpp maintainers did a major rework of the executors, essentially generalizing the concept used for not having to rebuild the list of entities to all executors (that before was used only by the StaticSingleThreadedExecutor and the EventsExecutor).

Let’s have a look at the performance of the executors today, using ROS 2 Rolling and the irobot performance framework.
I created a pseudo-random system with ~40 publishers and ~70 subscriptions spread across 8 executors, let them publish for 60 seconds and I got these numbers (which align with my expectations and previous results):

The conclusions I would like to draw out of this are:

  • The SingleThreadedExecutor is now more performant than the StaticSingleThreadedExecutor
  • The StaticSingleThreadedExecutor should now be deprecated: all its improvements are now available in all the other executors; moreover there are open issues and concerns that are specific to its implementation.
  • The EventsExecutor should already be the choice if you are looking for improved performance and we should start the process to promote it as default executor.
37 Likes

Thanks for sharing this.

Shall we consider changing the default executor in the tutorial to the EventsExecutor?

1 Like

Are there any notable downsides to the EventsExecutor current implementation we should be aware of or where it is not feature complete?

I’d love to make a new isolated events executor container for rclcpp_components so that this could be a runtime option for Nav2 users (or just move over to it if there’s no notable downsides / issues).

1 Like

TL;DR;
before making the EventsExecutor the default one we need to fix a bug and provide an alternative, backward compatible, mode.


The long version:

There’s currently an important bug that makes the events-executor not usable in combination with simulated time: TimersManager doesn't follow ROS time · Issue #2480 · ros2/rclcpp · GitHub
The client library working group is currently discussing solutions.

Besides the aforementioned bug, there are some concerns that the current implementation of the events-executor has a different ordering for entities execution with respect to the other executors.
This concept applies to situations when an executor thread awakes and there’s more than 1 entity ready for work.

The EventsExecutor always executes entities in the order in which they become ready, while wait-set based executors have a fixed order: they always execute timers first, then subscriptions, etc; moreover there’s an implicit ordering also within each group of entities, based on the order in which entities were added to the executor or node.

Note that this contributes to the difference in performance: wait-set based executors iterate over list of entities and check which one is ready. If you have 50 entities in an executor, but only 1 is receiving messages, you will waste time looping every time (that’s why disabling builtin services/topics improves CPU with wait-set executors).
With the events-executor on the other hand you have a list of ready events each with an associated entity, so every time the executor has work to do, rather than iterating over multiple lists you will have look-ups in a map.

This difference affects also the situation where an entity has more than one item of work to do, such as a subscription that received 2 messages.
The wait-set based executors can’t process 2 items of work from the same entity within the same iteration.
So the SingleThreadedExecutor in this situation will 1) loop through everything and execute the first message, 2) try to go back to sleep 3) wake up immediately and 4) loop through everything again and execute the second message.
On the other hand, whenever the EventsExecutor wakes up, it will execute all the available events before going back to sleep.

IMO the EventsExecutor approach besides being way more performant, is also more correct, as it ensures that all events will be processed as soon as possible.
However, before making the EventsExecutor default, it was requested that we provided an alternative implementation that ensures the same ordering behavior as other executors.

The EventsExecutor design allows to support different behaviors through the definition of custom EventQueue class.
The current implementation in rclcpp just uses a std::queue under the hood, so the idea is to write a new class that internally will sort events as is done by wait-set executors.
This will obviously come at a performance cost (I expect performance to still be a lot better than the current default executor), but users can always switch to different queue implementations.
P.S. There are already other queue implementations in the community:

6 Likes

Or add an argument to the existing one :wink:

1 Like

In short, I don’t have fundamental objections to working towards making the EventsExecutor the default.

However, I think at least 3 important things need to be done to it before we can even consider that:

  1. It needs to support sim_time, which it currently does not.
  2. It probably needs to have a multi-threaded version, though this may be negotiable.
  3. It needs to be run against all of the tests we currently have. Note that for most tests in the core, we use the default (SingleThreadedExecutor) by default, unless the test explicitly requests otherwise.

Once those 3 major things are done, I think the EventsExecutor could be promoted out of the experimental namespace, and then we could consider making it the default.

9 Likes

Why does the multi-threaded executor have such a high latency? This seems counter-intuitive to me if running on multiple CPU cores. Is there some major copy penalty?

Short answer:
The multi threaded executor rebuilds the waits set for almost every executed entity (sub, service timer).

Long answer:
There is no concept of buffering ready events for every callback_group and there is only one set of ready events at a time. So what happens is :

  • WaitSet with 2 callback groups is build and waited on
  • WaitSet returns a WaitResult with multiple entities that are ready for execution
  • Thread one starts executing the ready entity and marks the first callback group as ‘in use’
  • Second thread comes by, and request something to execute. If the WaitResult only contains ready entries for the first callback group nothing is ready for execution(Mutual Exclusive).
  • Second thread builds a WaitSet only containing entities belonging to the non blocked (second) callback group, and waits.
  • First thread finishes executing the entity, and marks the first callback group as ‘not in use’
  • Second thread is woken up, as a callback group changed.
  • The WaitResult is empty. (The first callback group was not included)
  • Second thread rebuilds the WaitSet containing both callback groups. ← This creates the delay
  • Wait instantly returns as events are ready in the first callback group
4 Likes

Neat, looking forward to a rclpy version as well. There will be one right? Right? :smiley:

Arguably it could use a performance boost on that end far more than rclcpp.

2 Likes

The performance of rclpy is a topic we have also briefly discussed in some of the last client library WG meetings.
There’s definitely multiple issues there (both problems with large messages as well as problems with high frequency small messages).
Find a few more details here Client Library WG May 24th Notes.

The EventsExecutor design is based on the RWM listener APIs (i.e. you can get a notification when an RMW entity receives a message) and these APIs could be exposed in Python quite easily (they are in rcl).

It would be great if someone in the community wanted to try that.

2 Likes

In the context of rclcpp executors development, is there any interest in wrapping or otherwise recreating the features of the rclc executor? The rclc executor contains lots of features like user-defined ordering, and triggering conditions for callbacks.

It seems like it should be possible to use the rclc executor with rclcpp by extracting all the relevant rcl handles, but I haven’t tried it personally. If anyone has and would be willing to share an example that would be helpful as well.

1 Like

I personally don’t know much about the rclc executor, but I’m interested in knowing more about it.
It could either be wrapped or re-implemented, depending on whatever is easier (side-note: moving this as a common executor in rcl (not rclc) would be a very interesting proejct; I have no idea whether it’s doable)

I propose the following roadmap to take the EventsExecutor out of the experimental namespace:

P.S. note that I also started a discussion dedicated to how the experimental workspace should be managed here: On the `rclcpp::experimental` namespace.

A further promotion to default executor can be discussed as a following step.

@clalancette I think these 2 should be requirements to become the default, not to go out of experimental.

3 Likes

The StaticSingleThreadedExecutor has been officially deprecated!

This change has been applied to the rolling branch and it will be part of the next ROS 2 release (K-turtle, coming out in ~1 year).
The following release will remove this executor from the codebase

If you are currently using this executor, we strongly encourage to move to the SingleThreadedExecutor.

3 Likes

I’ve been working on a multi-threaded version of the events executor: GitHub - nightduck/rclcpp at mt_events_exec

In addition to adding a pool of worker threads, I removed the thread from the Timers Manager. Some work I’ve been doing has shown that processing the timers and subscriptions in different threads messes up the analysis you can apply to your applications. It’s better to have a single “delegator” thread that assigns work to various worker threads. The timers manager is now treated as a collection that gets polled for available timers.

I have more advanced features coming that are mostly of academic interest. But I thought that at this stage, the developer community would like to know this work exists.

Let me know your thoughts

1 Like

I am working on something similar. I also got a talk about it scheduled for Roscon2024.
I’ll clean up the repo and will post it here in some days.

I had a short look at your implementation. Either I am overlooking something, or you are completely ignoring the concept of mutual exclusive callback groups.

Would you mind to elaborate on this topic ?

@nightduck that’s great work, thank you for your contribution!

I noticed that the new timers approach does not respect the ordering of the events, i.e. all ready timers are inserted at the end of the queue, even if maybe they triggered before some of the other events.

Do you have some performance analysis with the new executor?

Yes, that is currently a TODO item. In the applications my group is focused on, it’s assumed that all tasks are reentrant, so I’ve been putting off mutual-exclusion support. I’ll make a github issue to remind myself to add that.

Right now, timers are only added after each loop of the delegator thread, so they naturally get placed at the end of the queue.
To remedy this, this work is going to be merged later with my work on priority event queues (we discussed that at ROSCon last year, and then a couple months ago on github). Release time can be used as the default sorting value.