ROS2 Middleware Change Proposal

Hello All,

As I mentioned in the last middleware working group meeting, the iRobot team would like to propose a change to how ROS2 handles incoming events.

We believe that rather than using user level waitsets, that an executor event queue design will allow events to propagate faster. When the executor thread is waiting on events to arrive, it simply blocks allowing the CPU to perform other work. When awakened, the events are processed in the order they are received. Each event contains the a type enumeration and a unique handle to process the event.

IMPLEMENTATION DETAILS :

Use a push interface with a queue to signal events to the executor.

This allows the custom middleware event handlers to notify the executor immediately via a callback pointer. The provided event type and handle allows access to the resource without searching. This callback datatype is (currently) part of the rcutils and does not require the executor data types to pollute the middleware implementation. This design removes the overhead of adding, polling, and checking the waitsets for subcriptions, clients and services when checking for new work.

Slower list maintenance is only performed when the resource is created or deleted.

Currently there is a single queue per executor with events presented to the user in received order with no implied priority of event types.

Executor blocks waiting on Queued Events

In this design the executor blocks waiting on a queue event. When signaled, the events are removed and processed. This will work using spin(), spin_some(), or spin_once() if needed by blocking or processing the entire queue, or only processing some of the items. This blocking method will allow the thread to wait without consuming CPU until user work is ready to be executed.

Offload timers to its own thread

The current design requires timer maintenance to be performed during the rcl wait operations. In this design we propose to offload the timer operations to its own thread and signal the executor of timer expiry using the event queue. An alternative interface could also execute the callback directly if the user specified it as such, however thread safe practices would apply to the user in that case.

Another advantage of timer offload is that the underlying operating system timer facilities can can be used to better manage the pool. Since these operate closer to the operating system scheduler, more accurate results could be obtained, especially for longer duration intervals.

DESCRIPTION OF TESTS :

We implemented a quick proof of concept for this design where we compared:

  • Default ROS2 master
  • ROS2 with new event queue method
  • FastDDS without ROS2

Tests were conducted on a RPi 1 (single core, ~700Mhz) using our performance framework and the Sierra-Nevada Topology.
The implementations use the static single threaded executor, running all nodes as a single process.
The intra-process IPC (rclcpp) has been disabled to force all messages through the full DDS interface.
We use eProsima’s FastDDS, and rmw_fastrtps_cpp as middleware in these tests.

There are 3 test results shown :

  • “Default ROS2” - This is the latest master branch of ROS2 without any changes.

  • “ROS2 w/Event Queue” - This is default ROS2, with a modification to use an event queue for subscriptions, clients, and services. Although, we do not exercise the client or service facilities in our performance framework. As described above, the executor blocks waiting on an event, with subscription callback assignment performed at create time. The timer offload is not implemented for these measurements so that we can gauge the impact of each feature change. Our performance framework still uses ROS2 timers, and should show load from timer maintenance.

  • “FastDDS w/No ROS2” - This is an implementation of our test framework without ROS2 using eProsimas FastDDS. This test allows us to see what the load and latency is without the ROS2 overhead. The message sizes and production rates are exactly the same as the Sierra-Nevada topology used for the other tests. Processing is performed immediately when the messages arrive.

PROOF OF CONCEPTS :

image

CPU Load for the default ROS2 is consistent with our previous experience.
With an event queue processing on subscriptions the load is reduced by 25%.
We believe that offloading the timers to a separate timer manager thread will reduce the load even more.

Message latency measurement is the interval between when the message is created, and when it is received. The default ROS2 has a baseline of just over 1ms. When using this new Event Queue the latency is down to 400us (reduction between 2 and 4 times). The direct FastDDS implementation shows that the ROS2 overhead accounts for just under 300us of time per event.

Raw Data :

CONCLUSION :

We understand that this is a very different design that what is currently implemented, but we believe that this will improve the flow of events through the stack so that the CPU bandwidth can be focused on user facing work.

What we have done here is the result of a simple prototype to prove out an architecturally sound event propagation method with as few modifications as possible. If this is acceptable to the group, we can implement a more formal approach for detailed review.

As always we look forward to the everyone’s thoughts,

Thank You,
-Lenny

13 Likes

Are you guys interested in discussing this as the main item in tomorrow morning’s middleware working group meeting?

Sorry I know it’s late notice but I didn’t see this until after dinner. If not we can do it another meeting.

We can discuss it to today. I don’t have a formal presentation beyond what is posted here though.

-Lenny

That’s ok, if you can just introduce it and summarize it, and then we can discuss. Thanks it’s on the agenda.

1 Like

I think this is an interesting proposal that would provide more isolation from how specific middlewares do things.

The current approach of “leave it to the middleware” is not wrong, but it does mean that we have a focus on what capabilities DDS provides. This can lead to bias in the API. It also means that some middlewares may be more difficult to create an RMW implementation for as you might have to provide additional functionality not provided by the middleware.

On the other hand, being closer to the middleware is good for low-resource environments and systems where minimum latency and jitter is required.

I think I’d like to see some of the concepts in this proposal be implemented, but I’d like to see it happen in a way that also allows a system integrator to use alternative approaches, such as middleware-provided waitsets, when that’s appropriate for them.

1 Like

Hi! In the last meeting of the middleware working group, it has been discussed how the new approach speeds up the execution of events having a dedicated thread waiting for events to arrive to the queue, and improving the speed search for the right executable.

The benefits are not only those (fast search and execute), so to have a better appreciation I wanted to add more information about the current maintenance of the waitset (not needed in the new proposal).

The current approach relies on setting to NULL the pointers of entities with no work to do (done both for rmw and rcl waitsets). This means that after each execution, the wait set needs to be rebuilt from scratch to start a fresh run with all valid pointers.

More specifically, this is what it has to be done for each run (in our case, at a high frequency)

Analysis of the most performant executor, the StaticSingleThreadedExecutor:

Clear wait set
 1. Check arguments for NULL
 2. memset to zero and zero init indexes of the rcl waitset: subscriptions, guard conditions, clients, services, events, timers
 3. memset to zero  and zero init indexes of the rmw waitset: subscriptions, guard conditions, clients, services, events

Add each entity to the rcl wait set
  1. Check arguments for NULL
  2. Init rcl wait set indexes: subscriptions, guard conditions, clients, services, events, timers.
  3. Assign the rcl waitset pointers: subscriptions, guard conditions, clients, services, events, timers

Add each entity to the rmw wait set
  1. Check arguments for NULL
  2. Init rmw wait set indexes: subscriptions, guard conditions, clients, services, events
  3. Assign the rcl waitset pointers: subscriptions, guard conditions, clients, services, events

rcl wait
  1. Check waitset is valid
  2. Check if wait set elements are not empty: subscriptions, guard conditions, clients, services, events, timers
  3. Manage timers (won't discuss that here, but has a big overhead)

rmw wait
  1. Check waitset is valid
  2. Assign waitset condition variable and mutex. Check if they are valid.
  3. Attach condition variable and mutex to:  subscriptions, guard conditions, clients, services, events
  4. Check wait set entities for new data: subscriptions, guard_conditions, services, clients, events (until finds first entity with data)
  5. Detach condition variable and mutex, and set to NULL the rmw pointers with no data: subscriptions, guard_conditions, services, clients, events

rcl wait
  1. Check timers
  2. Set to NULL the rcl pointers with no data, based on rmw NULL pointers: subscriptions, guard_conditions, services, clients, events

excute executables (rclcpp)
  1. Get number of entities: subscriptions, timers, services, clients, waitables
  2. Iterate through entities to find not NULL: subscriptions, timers, services, clients, waitables
  3. Execute not null entities.

The results we are showing in the graphs above are only removing subscriptions, clients and services from the wait set maintenance. Still rest of entities have to be cleared and re-assigned for every executor run (timers, events, waitables, guard conditions). We are working to handle these entities also with the event queue.
Sorry if this message got too long, but it could be useful for someone not familiarized with the current executor approach. Thanks for your time considering this proposal!

4 Likes

@Codematic thanks Lenny for the overview in the RT working group today. The rclc executor developed in micro-ROS, which is also available as the ROS 2 package ros2/rclc, is a thin executor in C based on the rcl layer (using wait-sets) but with deterministic semantics. It is intended for microcontroller use-cases, but as it is based on rcl it can be used for ROS 2 applications also. Besides that, we have made some PRs to make the DDS timestamp of each message (of subscriptions) available in ROS 2 Foxy:

With the goal to allow deterministic scheduling in the ROS 2 Executor, for example FIFO as it was the case in ROS 1. I am interested in an efficient implementation of such an executor which is also real-time capable and deterministic. Reproducability of test results and real-time requirements being the most important reasons.

3 Likes

Thanks for starting this discussion, it was also on my post-vacation list

The rebuilding of the waitset with each rmw_wait() call that @mauropasse mentions is an aspect where I also had the feeling that this could be avoided. Good that this is already addressed in this proposal

Yes @gbiggs, the rmw API is very DDS-like. We also realized that when starting to work on the non-DDS rmw_iceoryx. As ROS2 had primarily DDS middleware in mind, this is absolutely understandable.
People sometimes bypass ROS layers to be able to benefit from DDS features that are not available with ROS APIs. I also see a challenge there. By extending the rmw API with DDS features, this bypassing can be reduced and the DDS stack flexibility can be increased. On the other hand, this increases the bias towards DDS and makes it harder to write a non-DDS rmw.

Coming to the proposal.
Cool aspects are that there is much more flexibility on middleware side and that the events are already sorted according to the order in which they were received.

I guess the waitset then disappears on rmw/rcl API layer. Which events shall be pushed in the queue is given by which elements are created and destroyed, right? I.e. if a subscription is created, we expect that every incoming sample results in an event for the executor queue.

With the current concept, the middleware does not have to create additional threads. rmw_wait() is called by an executor thread and does all the job. With this new proposal and the waitset-centric DDS API, I would assume that in many rmw implementations there would be an additional thread now where a waitset is maintained that is updated with every create and destroy of subscriptions, service etc. Or what is your feeling?

Regarding the timer thread. Shall this also replace the timeout that is currently used in rmw_wait()? If so this would be completely decoupled from the events coming from subscriptions , services etc. In rmw_wait() this currently can all be handled together. And when something happens, like an incoming sample, the timer can be reset. If this is separated you can get a timer event which is somehow obsolete because another event like an incoming sample came in while the timer was running. Or am I wrong here?

I like the proposal in general, but we need to be clear if we’re talking about an event queue or a work queue. I think iRobot is suggesting an event queue, but I believe that a work queue will be necessary.

Terminology: An event queue contains state changes. i.e., when a topic goes from having no messages to having received a new message, it would contain a “new message for topic x” event. In contrast, a work queue would contain the message itself with a link to the relevant topics or callbacks (representing the work to be done).

An event queue has a problem that I encountered also when I added the timestamp support: By the time you get to process the event, the queue might have changed, and the data you’re looking for may have been dropped already. However, at least with DDS, you don’t notice this until you’ve actually taken the data.

1 Like

Not necessarily. The DDS implementations I have looked at allow you to register a listener to the communication object which gets a notification whenever there is a change.

@Ingo_Lutkebohle. Yes, that’s a valid point. I already used listeners for topics that returned the affected reader in the callback.
Anyway, there is flexibility as you can do the “bookkeeping” job in the rmw implementation or inside the DDS stack. Whatever makes more sense.

Although I like the idea of work queues in general, how does a work queue avoid this out-of-dateness problem? Or conversely, why is it not a problem? Although unlike an event queue it would have the work-to-do (the message data) available even when it’s out of date, you still have to spend time processing that data is out of date. With an event queue you notice the data is no longer available and move on, with a work queue you notice the chunk of work is out of date and move on.

That depends on what you mean by “out-of-dateness problem”.

The problem I was looking to solve, and where work queues differs from event queues is, that you can still get the data even though it may be superseded already. That’s because the work queue keeps it until the executor removes it.

A different problem is that you may have set the subscriber queue to a small size because you do not actually want to process something that is superseded. This is still possible with a work queue, because you can inspect the entire queue.

However, I would argue that in general, one should strive to avoid such situations, because they are difficult to get deterministic, and thus are exactly the kind of hard to find bug that can cause sleepless nights…

OK, that’s pretty much what I was expecting. I agree that it sounds more robust and flexible than an event queue.

Originally, i wanted a work queue for this, because it moves the data out of the communication library and presents it to the executor in order it was received. This is a common design in telco gear, and works well there. However, with DDS as the communication framework, we were concerned about loosing the history cache, and its associated functionality. When looking closer we realized that using an event queue was closer to what the current infrastructure implements so it seemed easier to move the current code into that direction.

We believe that having an event queue is more flexible also, and allows the addition of other message event types that could support a work queue style event. Because the event type and data isn’t processed by the ROS2 middle-ware layers, you can pipe though any kind of event you want. In my mind i can see this being used for a ZeroMQ, direct IPC, or bus Interface if other advanced communication features aren’t needed.

Thanks for the input !
-Lenny Story

… events are already sorted according to the order in which they were received.

Agreed, event ordering is a key feature we need for our use case.

I guess the waitset then disappears on rmw/rcl API layer.

We think so, our intention is to not poll, or have to update these constantly. Our goal is to reserve as much CPU as possible for processing events, rather than on the infrastructure.

Which events shall be pushed in the queue is given by which elements are created and destroyed, right? I.e. if a subscription is created, we expect that every incoming sample results in an event for the executor queue.

Yes, this is another feature we need. The events happen in the system as they occur, so that the executor sees them with the same timing and order as the real world. (within system constraints of course)

Regarding the timer thread. Shall this also replace the timeout that is currently used in rmw_wait()? If so this would be completely decoupled from the events coming from subscriptions , services etc. In rmw_wait() this currently can all be handled together. And when something happens, like an incoming sample, the timer can be reset. If this is separated you can get a timer event which is somehow obsolete because another event like an incoming sample came in while the timer was running. Or am I wrong here?

The implementation does have a wait, but its only for waiting on the event queue. The rcl layer is just a pass through. The decoupling is intentional, i think that having the events arrive in the order they were received is important, as well as their timing. For messages, my experience is that batching can add more load than just processing what you get when you get it.

Thanks !
-Lenny Story

The DDS implementations I have looked at allow you to register a listener to the communication object which gets a notification whenever there is a change.

If rmw/rcl had support for registering a listener, would it be possible to implement a work/event queue “out of tree” (i.e. in a different repo)?

I think it’s much easier to get an agreement about adding support for listeners than modifying how executors work.
With that done people will be able to experiment with their custom executors without the need of keeping patches in a fork. That experience can be used later to take a decision on how executors should work.


I don’t think that a wait set based approach necessarily has bad performance, I think that the bad performance comes from a bunch of implementation details. e.g: we need to add conditions again to the wait set in each iteration, because we use the wait set itself for indicating which conditions were active. If we had a output argument to indicate the active conditions, we wouldn’t need to rebuild the wait set in each iteration (I estimate this is the biggest source of bad performance in the current approach).
It would be great in the future to profile the code and improve the performance of the wait set based approach, I think performance can be greatly improved without taking an architecturally different approach (though this is only speculative, I haven’t done any profiling of the code).

Best,
Ivan

1 Like

When we did the design on the timestamp information changes for Foxy, I originally had the same line of thought. However, after the design discussion in Improvements to rmw for deterministic execution · Issue #259 · ros2/design · GitHub, I came to the conclusion that you cannot have deterministic, in-order processing while keeping data in DDS buffers.

The full argument is at Improvements to rmw for deterministic execution · Issue #259 · ros2/design · GitHub but since that comment is rather long, let me summarize:

Two options have been suggested for processing data in temporal order:

  1. Obtain timestamps for all currently available items, sort data by timestamps but leave it the in DDS queue, then take data one-by-one during processing. Now, by the time you actually take the data, it may have been replaced by newer data. You can then either process this newer data item, which destroys time ordering, or you can defer processing. If you defer processing, starvation may occur if this happens again and again.
  2. Take all available data, sort it by timestamp, then process it. This does guarantee temporal ordering in all situations, at the price of sometimes using older data than is available.

It follows that option 1 either runs the risk of starvation or destroys temporal ordering and option 2 cannot guarantee using the newest data. These two goals are incompatible.

Neither of these options is thus ideal, but arguably, a situation where you can’t keep up with the incoming data rate must be considered exceptional (if it isn’t, you can apply throttling in a deterministic way, e.g., by a take-nth filter). It is important that in exceptional situations, the system still behaves deterministically. Option 2 is deterministic, option 1 isn’t.

Since an event queue translates to option 1, and a work queue translates to option 2, I argue that we need a work queue.

3 Likes

Agreed.

I generally agree, but with some comments: Traditionally, wait_sets were used to bundle together work, this optimizes for throughput. Listeners were used when latency is important, optimizing for responsiveness.

However, in recent years, queue-based approaches have become popular because they combine positive aspects of both and can support higher data rates. For example, the Linux io_uring APIs.

Hello All,

We’ve finally completed the work on this proposal.

We believe this has greatly improved the performance of system load and message latency.

The full performance report is here :

executors benchmark.pdf (1.0 MB)

The following are the applicable PR’s

rclcpp PR - https://github.com/ros2/rclcpp/pull/1416
rcutils PR - https://github.com/ros2/rcutils/pull/303
rcl PR - https://github.com/ros2/rcl/pull/839
rmw PR - https://github.com/ros2/rmw/pull/286
rmw_implementation PR - https://github.com/ros2/rmw_implementation/pull/161
rmw_fastrtps PR - https://github.com/ros2/rmw_fastrtps/pull/468
rmw_cyclonedds PR - https://github.com/ros2/rmw_cyclonedds/pull/256

There is also a design PR :

design PR - https://github.com/ros2/design/pull/305

We ran, and passed, all the current test cases against this new executor as well as created a few new tests as well.

We look forward to hearing your thoughts,

Lenny Story
Mauro Passerino
Alberto Soragna

13 Likes