Deterministic replay and debugging

I need to build a ROS system that can log its inputs and node communication such that all calculations can be reproduced exactly during analysis later. The system needs to have two modes:

Operation

  • The system is in control of the physical robot
  • All sensor data is recorded
  • Sufficient data about the timing and data dependencies between the nodes is recorded.

Analysis

  • The system is not connected to the physical robot
  • The recorded sensor data is replayed to the system
  • The relative timings and data dependencies between the nodes are preserved. This means other nodes may have to idle if one node is paused in a debugger.

For example there may be a system with this topology A --[/x]--> B --[/y]--> C where A,B,C are nodes and x,y are topics. All three nodes are implemented with a polling loop (using ros::Rate and ros::spinOnce).

During analysis the node B is stopped in a debugger. Now A needs to pause because otherwise the topic queue x would overflow and messages would be dropped that were not dropped during operation. Same with C: It needs to pause because it is not getting the messages from B that it did get during operation.

This seems like a standard problem, yet I cannot find anything in the ROS ecosystem that addresses this.

I have some ideas for developing this myself, but I’d rather use an existing solution.

rosbag appears to be unsuitable: The limitations of rosbag record/play

ROS’s pub/sub is inherently asynchronous, and one of the drawbacks of that is there is no flow control implicit in the communication. For example, if you wanted a publisher to only send data when a subscriber has space available for it, then I would call that synchronous pub/sub. The advantage of asynchronous is that you can easily support one to many publishers, which in turn makes it easier to support features like playback of many recorded data streams from a single entity (essentially rosbag). The advantage of synchronous is that you can control the flow of the data for intermediate topics, as you’ve described above.

The problem you’ve described is sort of the problem that ecto was designed to address:

It’s been around for a while, and I think people are still using it in perception like pipelines (sort of what you’ve described above) to much success. I’m not sure how active development is on it these days, but I think it integrates quite well into ROS.

Ecto creates a separate, self-contained synchronous graph which exists in a larger ROS system as a single node, and so its internals are not introspectable with the normal ROS tools.

It’s possible, in theory, to enforce a synchronous flow control over a set of asynchronous ROS nodes using extra topics and services to control the flow of the internals of each node, but I’m not aware of anyone who has done this in a generic way. Maybe someone else can speak up if they know of one.

For ROS 2, we’ve been discussion how we might do this in order to determine if there are any features missing in ROS 1’s communication system that would prevent us from doing so, but we’ve stopped short of writing this down in a white paper like document or prototyping it. The basic idea is to make it possible to control behavior of each node in the synchronous graph through a polling mechanism, and then use that polling (or pumping) mechanism to implement a supervisor who “fires” each node in sequence. Then the fact that the asynchronous comms is being used between the nodes is unimportant.


Some workarounds you can consider, is to only “step” rosbag manually, assuming that the beginning of your pipeline only needs one message to start the chain, e.g. would not work for a stereo vision pipeline where you’re playing back the images from both cameras. You can also write a script to do this based on some condition with the rosbag api. You could also implement your own flow control using services and other mechanisms, however this is usually a lot of work and leads to less reusable code.


I believe that OpenRTM can do something like I’ve described, but I don’t have any pointers off-hand. I just remember reading about it in the past with respect to their “executors” which control the execution of individual “nodes” (might be a different term in their nomenclature), even across processes.

Also, I believe that LCM allows you to playback data in a way that it will not overflow the queue while a downstream subscriber does not have queue space available. I don’t know if this applies to intermediate topics between two entities, or just for playback of a single topic. Again, no pointers on that off-hand, but I remember reading about it.

Also, related to the ROS 2 work, I think we could use some of the DDS QoS settings to implement a publisher which could block is any downstream subscriber is full, but there are reasons to not do that as well. If you imagine that someone adds an introspection tool to a topic as a subscriber which slows down the publisher unexpectedly and would adversely affect the behavior of the subscribers in the downstream nodes.

3 Likes

@janismac As you said, this is a standard problem that we all face when debugging any real-time and multi process system, being ROS a clear example of that.

I am planning to work on this problem in the next months, as part of my strategy of enhancing debuggability of robotic applications. I will be happy to share effort with anyone interested to collaborate.

@wjwwood Thank you for sharing the link about ecto, I will take a look at it.

Hi. I am not a very experienced user yet, but I was thinking that Orocos already provides this concept. Doesn’t it?

You can always check if the information that you read is old or new. On the other hand, you can even schedule the components to run in a certain order. The biggest advantage is that it can work in harmony with ROS nodes.

I have the feeling that what @janismac is looking for is this: https://github.com/mozilla/rr

But let’s take a step back in this discussion.

We have two ways to debug our applications, both these approaches have their limitations and advantages.

1. Using classical debugging tools like GDB, which interrupt execution through breakpoints.

The “breakpoints” approach does not work well in multiprocess and real-time applications, as we all know (you can not alter the execution of a distributed system or a control loop without modifying its behavior).

2. Using logging and visualization tools.

I am personally in favor of the “log absolutely everything” approach. This is what I am working on, i.e. good visualization tools (in addition to the awesome Rviz) and VERY low overhead logging libraries. I don’t know how far I can go into the rabbit’s hole but I will try anyway :wink:

1 Like

Thanks for all your suggestions!

Ecto: Ecto seems to be always synchronous. I’m aiming for a system that is asynchronous in operation (for better performance) and synchronous when debugging from a log. Also not being able to use the ROS tools is a drawback.

Orocos, OpenRTM, LCM, etc.: If I have to, I could switch Frameworks, but I’d rather stick with ROS if possible.

rr sounds pretty good, but as far as I can see it only works on a single process. I’m not sure how one would use that with ROS across multiple nodes and topics.

logging and visualization: Yes those are useful for many problems. But I was thinking about Heisenbug type problems. What if you have a Segfault that cannot be reproduced? It’d be nice to be able to run through the exact same sequence of events in a debugger.

implement your own flow control using services: That’s actually the approach I’ve been working on for the last two days. I’ve developed a concept and prototype to do what I described in my original post. It seems promising so far, it can reproduce an example calculation that is very sensitive to timing changes between the involved nodes. I’ll write up the details tomorrow if anyone is interested.

What if you have a Segfault that cannot be reproduced? It’d be nice to be able to run through the exact same sequence of events in a debugger.

This is EXACTLY the kind of problem that can be perfectly solved using RR.

In a multiprocess system, only a single process will SEGFAULT. it you have recorded it with RR, you will be able to see precisily what happened, Since it will reply the state of the process, even if the other nodes are not running.

Don’t dismiss RR too quickly, as I said, I have the feeling that it already solve 90% of your problems :wink:

You can of course use synchronous services instead of asynchronous topics, but you are changing the whole architecture of your system; so it CAN be done, but it is not a decision you should take lightly.

And if you can build your system using a synchronous pipeline, then it means that you abused topics and nodes in the first place :slight_smile:

Davide, do you have experience using RR with ROS?

I haven’t tried it myself, just listened to the podcast about it. Would love to know if it scales well to many-node systems.

Good point! I’ll have another look at it.

I agree. As I said, I want it to be asynchronous during operation but synchronous during analysis/debugging in such a manner that it reproduces the communication pattern / data dependency that occurred during operation.

This is actually something I wished was possible many times, but for a different reason than debugging as OP. For us this was mostly about processing pipelines such as might be moddeled with e.g. ecto, for which we nevertheless prefer having multiple ROS nodes (for example in order to use the ROS tools to introspect also the intermediate results, compose the individual parts of the pipelines in different ways, etc…). Now for live operation on a robot that is fine. However when developing or testing, there is often the desire to run a recorded dataset through the pipeline as fast as possible and in a deterministic way. Doing this with a node graph and rosbag play is difficult. Playing back the bag file too fast might result in overflowing queues and playing it back to slowly you don’t crunch the numbers as fast as you could. Also, different optimal rates might apply for different parts of the sequence.

What we have done in the past is writing separate wrappers that use the bag API to run through the dataset one by one and invoke the algorithms in a deterministic pipeline (like e.g. ecto does AFAICT). However you loose composability and writing such wrapper is very tedious work.

So being able to operate in ROS graph in the kind of pulling / pumping mechanism that you mention would address this exactly AFAICT.

This thread is very interesting to me, as it is a recurrent problem developers of distributed systems always struggle with.

In a software system with multiple processes and async communication, determinism is already out the window, since all the combinations of messages, when received in different order, explode very quickly ( unless you pay attention to an huge amount of tiny details with the actual goal of becoming insane ). That is also the reason why you will never remove all bugs by testing a distributed concurrent system multiple times.

A more theoritically sound approach is required, but while computer scientists work on that part what can be done with our robots ? Here is a shortlist :

  • linearizability : all messages received by a process (node) should be ordered in a queue before doing anything else. This requires extra effort as it is not supported ( or at least api is not exposed ? ) in ROS. This also allows replaying messages received by a node (from the receiver view point, not like rosbag, which does it from a sender viewpoint). But that also means that the node relies only on messages received ( not clock, or data in a file somewhere, etc. ) otherwise they are other things to record and replay. Luckily anything ( I think ) can be modeled as a list of messages received at a certain time.

  • one thread per node. One needs to make sure the multithreading in a node introduced by ros ( services, topics ) do not have any side effect ( nothing modified outside the callback ). This could maybe be enforced with an extension ? Or drop the mutithread approach and just queue messages to be processed by the only node main loop.

  • global synchronous scheduler needs to know which message is received by which node (=thread) and the time, and record this. This ensure a possible deterministice replay later, providing that all inputs in the system is a message. Ideally it should also be the one telling a node to loop another time, so that node state can be saved before and after. This is where you can have a “debug”/“release” distinction, where release would spin as fast as possible ( possibly dropping fairness and determinacy, but these might be addressable by putting a few sleeps here and there, and avoiding accumulators… which is still humanly possible to find and fix )

I think these are the main points…
Actor Model : one thread, one mailbox. And to compose them you need a scheduler that ensure fairness of computation, record all and can replay all messages. A message is the only way to interract between threads.

Note 1 : one can have a scheduler inside a node between threads, and outside between nodes. But this hierarchy can makes things more complex than needed.

Note 2 : The OS scheduler is non deterministic ( in short because it is influenced by real world state that you cannot possibly control, and because it was designed for processes that compete for machine resources ) so you need to build and use your own in your application when you want cooperative process scheduling.

I hope this is helpful. My impression is that the tools listed are just doing a part of this list, and ROS as a framework lacks the necessary constraints to enforce determinism. Other framework might do that but then it s not ROS anymore :slight_smile:

I would :heart:love​:heart: to see libraries developed that could insure determinacy in a ROS system.

2 Likes

We ran actually into the same problem of needing deterministic data processing when receiving data from rosbags as @janismac. @asmodehn nicely summarized the points that are important when implementing a node with deterministic behaviour.

Considering the ROS communication part, we implemented a very simple handshake mechanism in rosbag playback to prevent message loss: We modified rosbag to keep track of the filling state of each registered node. Whenever rosbag sends out a message the queue filling counter for the corresponding node is decreased. When the corresponding subscriber actually processes a received message it signals a free slot in the input queue by a service call to rosbag. If one of the monitored queues is full, rosbag automatically pauses playback until the corresponding node signals a free queue slot again. The mechanism is only active if use_sim_time is set to true so it does not influence message processing when using “real” sensor data.

The system works fine for linear data flows, meaning that one message (block) sent from rosbag to node A maps exactly to one message (block) which is passed from node A to node B and so on. Message BLOCK means a couple of messages which belong together (for example a camera image and a corresponding camera_info message). With the mechanism you can set playback speed to lets say -r 100 and the processing will run at the speed of the slowest node in the processing chain.

We uploaded the modified rosbag and the rc_bagthrottler module to github:
https://github.com/roboception

At the current state of development, the documentation is a little bit rudimentary and there is still some “zombie” code. We are going to extend it and clean up the code very soon. Until now, it was not necessary for us to extend the mechanism for a further handshake directly between nodes but we would be happy about discussions on the subject.

2 Likes

@Korbinian_Schmid thx a lot for above code.

@facontidavide @janismac @asmodehn have you guys by chance been able to make any progress (that is release any code) wrt to deterministic replay of data as discussed above?

@Dejan_Pangercic
Sadly I couldnt work much on this recently, just on and off, on my free time.
I can advise a first step, which is to use https://github.com/ros-testing/hypothesis-ros to “blackbox” property-test your nodes.
Then, it is about practicing zen-like discipline when designing and implementing your ROS nodes…
I have a few design ideas in mind (functional-style event logging) that I am currently playing around with in (python) code, but nothing usable just yet.

Was there any development related to this topic in the recent year? Is there maybe something in ros2 that would help with deterministic replay as described in this thread?

We faced the same problem in our pub / sub framework that has the same conflict between loosely coupled messaging and deterministic replay and debugging. The strategy that we use currently is to link the software component core (that is a c++ library) against different kind of component frames. One of these frames is a simulation binding and is linked fixed against the measurement c++ API. So we inject the messages in a deterministic manner into the component interface in a single process.

Another approach to stimulate in a deterministic way is to create a python extension out of the component library and use python script language to load the measurement and stimulate the component core out of a python dictionary (that contains all messages and timestamps). In this case the debugging is more tricky but possible.

Wouldn’t something like this work for ROS too ?

You are right! I think it also need be solved for offline simulation

Hi everyone,

Have joined the forum with this very same problem. Are there any new methods to solve this problem ?

And forgive my newbie question if its the case. Is there a way to run a ROS node with data from a rosbag in “isolation” i.e standalone ?

Thanks,

@rreddy78 I’m not aware of an open-source solution.

For those who need to take their ROS 2-based systems to production, or you just want a more robust ROS 2, you can look to us at Apex.AI for features like this. We have a great solution for enabling deterministic replay of data within Apex.OS (our fork of ROS 2).

Please note that this is a proprietary feature and it will remain that way. I hate to dangle the carrot in front of you, but that’s the way it goes :upside_down_face:

2 Likes

Thank you for this information. It certainly seems a good solution once we want to move from prototyping.