Add Heartbeat message type

Hi, I’ve been missing a Heartbeat type of message for quite long in ROS (talking about ROS 1, but this probably also applies to 2). I’m about to open an issue/PR in ROS repos asking for adding it, but before that I wanted to collect some ideas here.

Why: If you have some system diagnostics that checks your sensors are running, you basically have two options now, none of which is easy and good:

  1. Use TopicDiagnostics on the publisher and publish whether the expected rate/delay is okay to /diagnostics.
  2. Subscribe to the output topic of the sensor and check the delays/rates on the subscriber end.

I think 1. is almost good, except it’s usually a lot of lines of code to set up diagnostics for a node. Also, to find and check the rate of the node in postprocessing, it requires going through all diagnostics messages and parsing them to find the relevant reports. Option 2. is not good at all for large sensor data, where you create unnecessary CPU and network load just to check the frequency.

The idea of the Heartbeat message is that it has the following definition:

Header header

That’s it. A code path that publishes the sensor data would just copy the header of the message to the Heartbeat message and publish the heartbeat right after publishing the sensor message (or before it?).

The postprocessing step is then made super-easy if you record the heartbeats. Also, writing diagnostic analyzers for this would be pretty easy and there could even be a generic one usable for all heartbeat greatly simplifying setting up the diagnostics. And there is of course no unnecessary CPU or network load.

Taking the idea further, NodeHandle could provide a method like advertiseWithHeartbeat() that would automatically set up and control the heartbeat publisher in addition to the normal sensor data publisher. The publish() call on the sensor publisher would automatically call the heartbeat publish, too. The heartbeat publisher could be automatically created with a standard topic suffix, so e.g. sensor publisher on topic /scan would also create the heartbeat publisher on topic /scan/heartbeat. This naming convention could be further utilized in e.g. rqt_graph or similar analytics tools. But that’s for future, now I’d like to concentrate on adding the Heartbeat message itself.

We have already implemented a custom Heartbeat message to our Ouster lidar driver and used it during the SubT challenge to verify the sensor is working as expected. It worked quite well. We can now also create “stripped” bag files which do not contain the large sensor data (so they are 1 GB instead of 40 GB), while still having an idea whether the sensor worked as expected at a given time point (thanks to the heartbeat messages).

One question remains: where should this message go? diagnostic_msgs would probably be a good place. But maybe std_msgs or e.g. sensor_msgs?

What are your opinions on this Heartbeat message?

3 Likes

Just to have mentioned it: wiki/bond has a couple messages it uses for this purpose, such as Status:

Header header
string id  # ID of the bond
string instance_id  # Unique ID for an individual in a bond
bool active

# Including the timeouts for the bond makes it easier to debug mis-matches
# between the two sides.
float32 heartbeat_timeout
float32 heartbeat_period

note that in general you cannot really conclude anything if/when you “still receive heartbeat messages”, other than the process is probably alive and at least your heartbeat publisher (and subscriber) are still working. Whether or not the node is still performing useful work would not be observable through this.

In ROS 2, you’d probably do this with the support for Deadline, Liveliness, and Lifespan (docs). At least when using DDS.

6 Likes

Thanks for your ideas, @gavanderhoorn . I think Bond is quite an overkill for the task at hand.

And yes, I know that the frequency of the heartbeat messages does not say much - there can still be sensor blockage, wrong processing of data etc… However, as the heartbeat should be published right before/after publishing the sensor message, you can at least figure out whether the sensor readout callback is called at the expected rate. For example with Ouster, UDP data can stop coming from the sensor either because of a network failure or because the sensor is overheated… And then the callbacks stop being called. Heartbeat is just the very basic diagnostic helping with a part of monitoring. If you have a critical sensor, you should of course not be satisfied with heartbeats. But in a lot of cases, they could be enough.

I’m curious why you went with Header instead of a minimal message, e.g. just a sequence number or a timestamp?

I think a heartbeat message is definitely useful, especially in areas where you can not guarantee a stable and constant connection to the robot.

Because Header is “special” in ROS. Messages with Header field have special support in a lot of tools like plotjuggler, TopicDiagnostics etc. There is also the C++ type trait HasHeader<M> which tools use.

In ROS 2, you’d probably do this with the support for Deadline, Liveliness, and Lifespan (docs). At least when using DDS.

Just to add to that, the ROS Safety WG had coded up a Heartbeat/Watchdog that relies on these DDS features. The heartbeat message is also easily added to a node via composition (e.g., in a launch file).

3 Likes

I never fully understand the purpose, but bond is already somehow built into ROS 1 diagnostics (was not ported to ROS 2 diagnostics), see:

Maybe you can adapt this to your usecase?

You could also have a look at the watchdogs provided by the ROS safety working group, although currently only implemented fpor ROS 2 (partially based on DDS features):

That’s not an integration, it’s plain usage of bond. The analyzer loader creates a bond to the aggregator so that when the loader node(let) is shut down, its analyzer is unloaded from the aggregator.

Thanks, it’s nice how you both wrote about the same thing at the same time =) So it seems ROS 2 already goes into this direction. If I understand it correctly, the linked SW watchdog is only a node-level watchdog and not a publisher-level watchdog, am I right?

1 Like

Hi guys,

I’m not an expert in RCLCPP API, but I’m pretty sure you can get access to the discovery graph directly, so you can use that to check whether a specific publication/subscription is present: As far as I know, it is possible to get a notification whenever the graph changes. ROS 2 CLI tools like ros2 topic list relay on this mechanism (through the ROS 2 daemon). Furthermore, if an entity dies unexpectedly, DDS has a builtin liveliness mechanism, so all the pieces should already be there IMO.

The publisher doesn’t need to die to stop publishing. It’s enough if it receives no data to pass further…

I’m mainly interested in the ROS 1 part, but the ROS 2 is also pretty interesting. As I haven’t used ROS 2 yet, I might have some beginner questions. The first one being - are the liveliness/deadline data from DDS recorded in bag files? I.e., is it possible to inspect from the recordings why did the system fail and find out “oh, this topic stopped being published”. Or is it only possible via some kind of watchdog that records its findings, such as the one linked to above?

Are you aware of the Heartbeat diagnostics task that exists within the diagnostics framework?

This works with all the standard ROS diagnostics tooling, e.g. aggregator and the rqt tooling.

It is only a couple lines of code:

  • Setup the updater
  • Add the task
  • Call updater.update periodically

Yes. It does the node-level heartbeat. I am mainly interested into publisher-level heartbeat. Of course, I could create the heartbeat task and call update() when publishing, but the Heartbeat task isn’t anyways supposed to watch any other frequency than 1 Hz (at least not as currently implemented in the downstream tools).

Not always. You should switch to a multithreaded or async spinner to make sure a choking callback will not stop diagnostics publishing. Then it starts complicating.

And to say it again - having the heartbeat published on /diagnostics topic means you need to do string-based parsing/matching to analyze the performance in postprocessing. I don’t consider that a very good option, at least not for things I want to check regularly.

Can you expand on how the “heartbeat” message verifies the sensor is working as expected?

To be concrete, there are multiple levels of perception failures from a LIDAR. We could break this down into sensor interface, sensor data, and perception.

At the sensor interface level, there are errors in communication of the data from the LIDAR to the compute platform running ROS. Dropped ethernet packets. Delays in arrival of the data. Basic errors in the data itself.

For sensor data, is the sensor producing data in spec? Did a HW component fail report any errors. Are the LIDAR points within spec | valid?

At the perception level, is the sensor blinded? Is a soda can or plastic bag covering the LIDAR? Is the LIDAR casing covered in snow, or grease?

Is the heartbeat a simple boolean capturing all of this diagnostic information for a higher level system wide health monitor? Or is this a simple “I’m still here” message, and the above diagnostic work would still need to be performed?

Thanks.

1 Like

The heartbeat messages in a lidar driver node as I see it can tell the following:

  • Heartbeat is coming
    • Heartbeat is coming at desired frequency and with reasonable delay
      • → Lidar is running, network communication works and the driver is processing the data coming from the lidar
    • Heartbeat is slower than expected or has an unusual delay
      • → There might be performance problems either on the computer or in the network
  • Heartbeat is not coming
    • → Either the driver is not running, or there is network failure, or the sensor is dead

I think heartbeat cannot and should not be trying to capture any kind of semantic functionality of the sensor. That is left for more advanced analyzers.

As you can see from my little “diagram”, the information the heartbeat tells is usually ambiguous. But practically, I know I want to start a robot and see a green checkmark somewhere telling me the lidar is “basically working” - on, connected, communicating, driver running. Even this simple heartbeat check can summarize that a lot of configuration and communication has been done right.

@peci1 thanks for sharing your thought :+1:

it requires going through all diagnostics messages and parsing them to find the relevant reports.

off topic, but i think this can be improved by Content Filtered Topic, which is W.I.P.

For example with Ouster, UDP data can stop coming from the sensor either because of a network failure or because the sensor is overheated… And then the callbacks stop being called.

i am not really sure why publishers are related to this? i think whatever happens to publishers, rmw implementation will notify the subscription if this requirement misses.

Heartbeat is coming

Deadline event will not be notified.

Heartbeat is not coming

Deadline event will be notified.

could you elaborate what is missing as requirement with your use case? i am interested in that.

thanks

1 Like

@tomoyafujita Thanks for your reply, it brings a lot of insight into the workings of ROS 2. My main objective here, though, is ROS 1. It does not offer any of these RMW goodies.

But staying with your ROS 2 notes, it looks like during runtime, a lot is available to monitor e.g. a driver. I have two questions left:

  1. Can any RMW tool/API be used to only get the rate of publication without actually subscribing? This can be a big deal with Realsense-like sensors which can generate almost a gigabyte/s of pointclouds. As much as possible, I only want to have a single subscriber for this amount of data - some perception node. But I also want to run a separate simple diagnostics node that just checks the data rate (without bloating the perception node too much).
  2. Can any of the deadline events be recorded? Imagine a robot that failed during a mission and you have to investigate the reason. Will you see somewhere that the deadline event was notified? Or do you have to explicitly generate a message on receipt of such event and store it in the bag yourself?

Just to explain this a little bit more - the update() function will get called many times a second, but the diag. publisher will only aggregate once a second whatever happened and publish this aggregate information. It is not possible to figure out from the /diagnostics topic which exact sensor message got missing.

@peci1 thanks for the information!

Can any RMW tool/API be used to only get the rate of publication without actually subscribing?

No i do not think so.

Can any of the deadline events be recorded?

No, i dont think either. i think it is application responsibility to manage after the event is notified.

I think these statistics with endpoint graph would be useful to monitor the distributed system to see if the whole system works okay. just curious, does anyone know DDS supports any statistics API?

Fast DDS provides a Statistics module for exactly this purpose. There is also a GUI application Fast DDS Monitor that can be used to monitor the performance of the DDS network in real time. You may be interested in the tutorial to use it alongside ROS 2 Galactic.

You could use the eCAL RMW and then monitor your system (including frequencies and data sizes) using the eCAL Monitor standard tool.