Fast, accurate, robust replay in ROS2

Background

Hey folks,

I’m interested in creating a framework for deterministic offline processing of bag files. There seems to have been some discussions here and there on this topic scattered throughout the years but I want to start a discussion on how specifically I could go about trying to make this happen in ROS2. The goal here is viewed through the lens of deterministic offline simulation. I’m specifically looking at making this work in ROS2, not using some other library for subgraphs that I run offline separately and integrate with ROS2 on the robot.

The Problem

ros2 bag play doesn’t “close the loop”. It just replays messages at a fixed rate. Your execution machine too slow to keep up? Too bad, you’ll start dropping messages. Your machine really powerful? Too bad, it will still playback in real time and waste compute, unless you increased the fixed playback rate multiplier in which case it will replay faster, but get a little too excited and set that a little too high, or run some other compute intensive job at the same time, and your back to dropping messages. This is particularly an issue if you are trying to simulate your robot on a different machine with significantly different compute resources (either higher or lower). Or run batch simulations in the cloud where unused CPU cycles are just wasted dollars.

Not disparaging the design of course, there are very legitimate reasons for the behavior as is, but viewed specifically from the lens of offline processing/replay simulation something more “active” is needed. More of a bag-file based simulator than a naive bag replay

Goal

I’d like to get to the point where I can input a launch file and a bag file and get a deterministic output bag file no matter what underlying hardware I run it on, and have it always use the maximal available compute resources to execute without dropping messages (so weak machines are likely slower than real-time, powerful machines would likely be faster, and I as a user don’t have to tweak parameters to make maximum use of hardware).

As a follow on goal being able to take in a model of things like models of execution times for callbacks, so a launch file ran on non-robot hardware is much more representative of on-robot performance. But we gotta walk before we can run.

Questions

From my digging the existing conversations on this topic usually ended at either “we’ve thought about it but haven’t put anything together yet” or “go try XYZ library that isn’t ROS”. I’d like to focus this discussion on specifically how to do this in ROS2. I’m looking to start implementing this myself, but after a lot of research I’m having trouble figuring out what are the available options to work with in ROS2, or if it’s even possible with ROS2 at all right now. So with that being said I’m looking for a few pointers:

  1. Can this be done in a general way at all with the design of ROS2?
    Or maybe put another way “is the necessary information available to do this without custom node types/executors/custom heartbeat or processing status messages?”. Can I, from another node (like say a bag reader node) query the execution state of another node in the graph to know when I can/can’t publish a new message? I know I can query pub/sub topics, etc. But can I ask something like “Is node X processing any callbacks?” or “When is node X’s next timer configured to fire?”. If so can someone point me to where I might be able to find that API? When I saw “graph events” I had some brief hopes, but it doesn’t look like there’s any way to get execution state information from that, just node connectivity.

  2. If #1 is “no” then can this be done in a general way at executor level?
    From perusing the executor API there it looks like the sort of callback status information I’m asking for in #1 may be available. This is a less idea solution because you’d need to translate all of your launch files into 1 composed mega node (or have a launch file for on-robot and a mega node for offline), and wouldn’t work with rclpy nodes all together. But even then it would still be very valuable compared to not being able to do anything like this at all.

  3. Any other thoughts on the general approach here, particularly in the form of “this approach is likely to end in sadness, approach Y would be better”? Although I would like to avoid sidetracking the discussion to things that don’t actually meet the requirements of maximum resource usage and guarantees of not dropping messages like “Just tune your real time factor”

1 Like

A few features of ros2 / rosbag that exist already and might be worth having a look at for your problem are:

  1. The time source used by your node(s). The proper selection of time source is very important in order to get your rosbag playback to closely match the recorded scenario on different systems and be repeatable. By default, all your nodes will use the system clock, which will conflict with your rosbag playback (or a simulator like gazebo). However, by setting use_sim_time to True for your nodes and telling rosbag2 to publish on /clock (see here for the history of this feature: play - publish /clock topic · Issue #99 · ros2/rosbag2 · GitHub), they will instead use the time published on /clock (which for a rosbag will approximate the time sequence when the bag was collected). This is similar to how a ros2 system can operate in a simulator like gazebo that runs faster or slower than realtime. This also assumes your nodes are using the node’s clock and not chrono or some other time source directly and not conflating the current time with the messages header.stamp (e.g. like when looking up a tf transform). Doing this helps to avoid the stretching or shrinking of the time series on machines that cannot keep up with or can really outrun the bag’s output.
  2. The QoS of messages being published by the rosbag. If your topics are reliable and have a healthy buffer and lifespan, ros2’s middleware will do its best to guarantee the messages from the rosbag get delivered in the right order and without being dropped. but if your topics are best effort, you will lose messages if your computer cannot keep up.
  3. Syncing the node startup time with the rosbag starting to play back is also important. If you start your rosbag at the same moment that you launch your nodes, it’s likely the nodes will miss messages depending on the system. So ideally you need a programmatic way to have your rosbag playback wait for your nodes to start up. Launch event handlers are useful for this kind of timing (but I will admit still feels hack-y).
  4. Getting /tf_static from a rosbag can be unreliable (this is not a complaint waged at TF2 or rosbag, just what i experienced rather recently). Personally, I have resorted to just republishing the tf_static messages locally (i.e. launch a separate instance of robot_state_publisher) rather than relying on the rosbag to reproduce them consistently. Since they are static, this doesnt have an impact on replay fidelity afaik.

I know that these do not fully solve your problem, but I have found them helpful in getting my own rosbag “sims” to be much more repeatable. And maybe these are relevant to whatever framework you end up arriving at. :smiley:

I know you wrote you specifically want to do this with ROS 2 and not with “library XYZ that’s not ROS”, but I believe the following presentation from ROSCon18 deserves at least to be mentioned:

Deterministic, asynchronous message driven task execution with ROS (slides, recording):

This presentation discusses the building of a task-driven, message-synchronized execution abstraction layer on top of ROS’s message passing, aiming to mimic a deterministic system while continuing to use a non real time operating system. We discuss the overall structure of the framework, as well as the details of various capture and synchronization policies, and the profiling and diagnostics of task executions. We further discuss methods for using simulators and bagfiles in conjunction with the synchronization framework for repeatable testing and issue reproduction.

Link to the framework: ZebraDevs/flow (used to be fetchrobotics/flow).

The presentation Deterministic reversible debugging of ROS nodes with Mozilla rr also seems to fit the topic (same track, same ROSCon18).

PS: @tfoote: the video(s) from ROSCon18 seem to “have not been rated”, similar to what happened in ROSCon recordings on vimeo require login ).

1 Like

At Apex.AI, we implemented (closed-source) something very similar to what you described. We gave a talk at the ROSCon describing the main ideas behind our implementation: Say goodbye to flakiness in your integration tests on Vimeo

We are not using the default ROS executor but the solution we implemented does not rely on anything specific to our executor. The central coordinator is not querying status information (only node topology). The nodes themselves inform the coordinator of their status (waiting for some messages, executing some messages…)

1 Like

I’m currently working on something very similar to what you described. It addresses exactly the problem layed out in the Apex.AI talk, and altough i had not seen this talk before i think i arrived at a very similar approach. My solution is not open source yet, but i plan on publishing it (once my masters thesis is done, which this is the main part of).

I put a small summary of the issues and approaches on my website: Deterministic ROS2 Callback Scheduling :: Jonas Otto

2 Likes