@ivanpauno I would like to upload the flamegraphs and ftraces mentioned in the above report, but Discourse does not let me.
The flamegraphs are svg, but are then displayed as raw images, so making them useless, on the other hand the ftraces are .dat files and Discourse don’t like them.
Since the plan is to organize a break out session for this, we can definitely show them there and find a way to distribute them to the participants.
@Codematic Sorry, I haven’t gotten to that yet. I made a note for tomorrow morning’s meeting to survey the people in attendance to see what time/date would work for them, hopefully in the next few days.
I still think having a spin-off meeting is the right thing to do so we can dig into the technical details with a smaller group, but the agenda for tomorrow is light, so we could discuss some things there as well.
Based on the feedback from the meeting, and my schedule. I’d propose a meeting at 9 am PST on Monday November 9th. Does anyone have a conflict for that time/date?
I’ve added a calendar event to the ROS 2 upcoming events calendar:
Following up from our special Middleware Working Group meeting on this topic, we didn’t come to any concrete conclusions or action items, but there seems to be a lot of support for separating the Timers from wait sets, using callbacks via the rmw API rather than wait sets, and using an event queue to integrate the two.
We’re going to have another meeting soon, I’d like to propose we meet again at the same time next week, so Monday November 16th at 9 am PST. I’ve added it tentatively to the calendar. Please let me know if any of you have any conflicts.
Here are some notes that @carlossv took at the meeting:
Problems with the current waitset implementation:
Waitset performance
The waitset has to be re-created every time
You have to iterate over all communication objects
Timer performance
Timer handling is not efficient. It can be improved.
The current approach makes it easier to combine timers with message handling. No need to bother about concurrency issues or message processing/timer ordering.
Message processing ordering
Problems with the event queue proposal:
The queue is unbounded.
How to deal with overload situations or when events get out of date
How to handle timers. Is it okay to handle them in a separate user-thread?
To solve the unbounded queue issue, Ingo proposes:
Use a ready-set, list or fixed size queue/array of communication object handles.
Set a flag to signal the data is ready
Take all the data and process in order
Use a dirty flag to prevent adding events that are already added
The problem of taking everything is this could lead to counter-intuitive situations where the data handling is overloaded and data is skipped/dropped. This could be acceptable only if a temporary overload is expected.
(Note from William): imo, the counter-intuitiveness of the behavior is not due to dropping or skipping the messages, but instead due to the fact that messages taken in batch may result in processing old data when newer data is available for signification periods of time when the history depth is more than 1
On the other hand, taking just one message data is a bad idea.
(Note from William): I don’t think it’s necessarily a bad idea, it’s just got performance problems due to querying the wait set so frequently (after each event). It also contributes to non-determinism when dealing with multiple subscriptions and/or timers due to scheduling race conditions.
Another option to solve the unbounded queue issues is to set a maximum queue size.
Open questions. What to do when the queue is full? Prune the queue?
How DDS users typically handle timers
Create a user thread to handle the timers
timers set guard conditions
In the next meeting, some specific problems will be discussed so they can start making some decisions about how to solve them.
Also, something that came up in the meeting is the idea that: integrating execution of Timers and other events (like new messages available on a Subscription) using an event queue is important, because that’s how it’s done in ROS 1. Allowing users to assume (whether aware or not) that callbacks are not called concurrently, even with timers.
I was wondering if that was true, so I did some digging, and it looks like in roscpp that is the case, but in rospy it is implemented by giving each timer a separate thread:
In roscpp there is a thread that monitors what timers are ready, and pushes “events” into the callback queue:
Thread being created:
Thread function adding callbacks to the callback queue:
Then the timer is called from the callback queue’s execution, so it resembles the proposed design quite a bit. The callback queue is also unbounded AFAICT, as it uses a std::deque and doesn’t prune it anywhere that I can see:
Note that while this is interesting, the approach and behavior of ROS 1 was not necessarily ideal with respect to performance and determinism, so it may not reflect the best solution for ROS 2.
Thanks for the summary @wjwwood@carlossv. Here some thoughts and questions from my side.
Besides the better handling of the timers with this proposal, the benefits of the event queue are the naturally given temporal order of the events in the queue and that there is no need to iterate over all entities to check for work, The reason for not having a work queue is that we would loose the history cache of DDS. This functionality would have to be re-implemented on higher layers when needed. The history cache is preserved with the event queue but it is also the source for workarounds.
With the history queue and the event queue there are now two queues that are accessed at different times and that behave in a different way. When the history queue in DDS overflows and drops samples the number of available samples in DDS and the number of events in the event queue diverge. This destroys the temporal order and leads to “empty” takes from rmw. E.g. if you are spinning with 10 Hz and want to process the latest greatest velocity that comes with 100 Hz, the history QoS can be set to 1. This would result in around 9 empty takes and also 9 unnecessary pushes to the event queue with expensive mutex operations and eventually context switches. Maybe there are better ways to solve this use case?
We discussed the idea of having an ID that indicates the specific sample connected with the event. Then you tell the rmw layer with a take which sample you want to have and the temporal order can be preserved. This is dangerous as in the worst case timing all samples you want to take are no more available and you will get no samples at all. Another option to preserve the temporal order would be to first take everything from rmw and sort it before executing the callbacks.
The “empty” events can be avoided with the discussed dirty flag. But I have the feeling that we would start building something similar to a waitset just to save one or two iterations over the entities that remain with an optimized waitset approach.
One use case I, and maybe others, see is to have one or a few triggers that shall wake up the spin and the majority of the entities are then processes according to what is available when the trigger comes. E.g. an image topic is the trigger and from the odometry topic the latest greatest shall be processed once the image arrived. I feel that the waitset approach could be extended to support this. You can argue that this is also something that can be put on top of the event queue but the consequence would be even more unnecessary events and context switches.
The proposal seems to have a great performance boost compared to the current waitset implementation. Since the proposal is on the table, this could lead to an immediate improvement. But I also feel that there are different goals like responsiveness and determinism that are hard to combine with one approach.
Would it make sense to have an event queue executor that focuses on performance and does no further things to really guarantee temporal order etc., and to have in future another optimized waitset based implementation? This then focuses more on determinism with guaranteed temporal order or user defined order and could also be used in scenarios where only some of the entities are triggering, like I describe above. Maybe it could even be combined with polling subscriptions?
@Ingo_Lutkebohle can you clarify how a ready-set is different from a waitset?
I get the point that if you use DDS waitsets there’s an “impedance mismatch” with ROS 2 waitsets, and that hurts performance.
But if the API of a ready-set is going to be the following:
attach methods to add entities.
detach methods to remove entities.
wait method. This can return only the “ready handles”, so you don’t have to iterate over all.
The “ready handles” can be pre-allocated and bounded (as you now how many entities were attached).
It sounds like it’s exactly the same (conceptually) to a waitset, but avoiding the performance issues of using DDS waitsets behind the scenes.
Is my understanding of that proposal right?
If that’s the case, I think that having first class support for listeners in rmw and instead implementing “waitsets” (or “ready sets”) on top of that API in an upper layer would be a good idea.
It not only avoids the performance issues, but it also makes the rmw API simpler avoiding the burden added to rmw vendors that don’t have a waitset implementation (see rmw_fastrtps),
I think it’s good to differentiate the three different issues we’re discussing:
How timers are handled?
Up to now, only the performance issue was mentioned, but that’s not the only issue of the current timers approach.
Timers aren’t working with simulated time and RTF > 1.
It’s much easier to handle that case from a custom timers thread.
You can use an “event” mechanism to notify that a timer is ready if you don’t want to handle directly the work in the timers thread.
Performance, the two main issues seem to be:
Rebuilding wait sets in each iteration.
Timers
Determinism of the execution
I think that not using DDS waitsets, but rather having a listener interface as proposed https://github.com/ros2/rmw/pull/286 will allow implementing executors with better performance.
After that is added, there are a lot of different ways to handle execution with different trade-offs between performance and determinism (event queue, work queue, ready-sets, and small variants of them).
I don’t have any execution model preference, I would just pick one (with a preference to the ones that respect the DDS QoS contracts better).
IMO, the only important thing is to add an rmw API that is flexible enough to implement any of the models on top of it, and the current proposal seems to achieve that.
I think it’s reasonable for us to have our own QoS policies for the ROS parts of the stack, separate from QoS applied to the middleware layer. Having suitable executor QoS settings would give some conceptual unity to how execution and communication are configured.
I like this approach because it allows the true power of the executor model to be realised.
@ivanpauno thanks for the accurate summary. I agree with your proposal.
Some remaining questions:
should this be an RMW level-only API, or do we expose to the user through rcl. I’m undecided.
what kind of events to provide – only “ready state now active”, or also things like “message evicted from queue” or “n items available”. Not sure what’s available in DDS, but it would help taking if we knew the amount of messages that are available.
One thing I’d like to add is that we should really also provide a “default” executor that uses this, or adapt the existing one, to make sure the API really works.
Lastly, we might consider adding the concept of a condition, where the wait does not necessarily return once anything happens, but could call a user supplied function to determine whether the current state is sufficient. This would reduce spurious wakeup.
Sorry, I’ve been on vacation, but I’m back and we could still have it, though I don’t know if we need to.
We could also cancel the next meeting, as we don’t have any items proposed for the agenda, other than some old business which I think we could do off-line.
I think we’re still just trying to work through the backlog of design changes requested.
Also, I actually don’t have the ability to record it though, since I’m not with my desktop and we don’t have recording through Google meet any longer.
I’m inclined to cancel this one, due to the above, unless someone has an item for the agenda.