Add Heartbeat message type

peci1 · February 3, 2022, 11:20am

Hi, I’ve been missing a Heartbeat type of message for quite long in ROS (talking about ROS 1, but this probably also applies to 2). I’m about to open an issue/PR in ROS repos asking for adding it, but before that I wanted to collect some ideas here.

Why: If you have some system diagnostics that checks your sensors are running, you basically have two options now, none of which is easy and good:

Use TopicDiagnostics on the publisher and publish whether the expected rate/delay is okay to /diagnostics.
Subscribe to the output topic of the sensor and check the delays/rates on the subscriber end.

I think 1. is almost good, except it’s usually a lot of lines of code to set up diagnostics for a node. Also, to find and check the rate of the node in postprocessing, it requires going through all diagnostics messages and parsing them to find the relevant reports. Option 2. is not good at all for large sensor data, where you create unnecessary CPU and network load just to check the frequency.

The idea of the Heartbeat message is that it has the following definition:

Header header

That’s it. A code path that publishes the sensor data would just copy the header of the message to the Heartbeat message and publish the heartbeat right after publishing the sensor message (or before it?).

The postprocessing step is then made super-easy if you record the heartbeats. Also, writing diagnostic analyzers for this would be pretty easy and there could even be a generic one usable for all heartbeat greatly simplifying setting up the diagnostics. And there is of course no unnecessary CPU or network load.

Taking the idea further, NodeHandle could provide a method like advertiseWithHeartbeat() that would automatically set up and control the heartbeat publisher in addition to the normal sensor data publisher. The publish() call on the sensor publisher would automatically call the heartbeat publish, too. The heartbeat publisher could be automatically created with a standard topic suffix, so e.g. sensor publisher on topic /scan would also create the heartbeat publisher on topic /scan/heartbeat. This naming convention could be further utilized in e.g. rqt_graph or similar analytics tools. But that’s for future, now I’d like to concentrate on adding the Heartbeat message itself.

We have already implemented a custom Heartbeat message to our Ouster lidar driver and used it during the SubT challenge to verify the sensor is working as expected. It worked quite well. We can now also create “stripped” bag files which do not contain the large sensor data (so they are 1 GB instead of 40 GB), while still having an idea whether the sensor worked as expected at a given time point (thanks to the heartbeat messages).

One question remains: where should this message go? diagnostic_msgs would probably be a good place. But maybe std_msgs or e.g. sensor_msgs?

What are your opinions on this Heartbeat message?

gavanderhoorn · February 3, 2022, 11:50am

Just to have mentioned it: wiki/bond has a couple messages it uses for this purpose, such as Status:

Header header
string id  # ID of the bond
string instance_id  # Unique ID for an individual in a bond
bool active

# Including the timeouts for the bond makes it easier to debug mis-matches
# between the two sides.
float32 heartbeat_timeout
float32 heartbeat_period

note that in general you cannot really conclude anything if/when you “still receive heartbeat messages”, other than the process is probably alive and at least your heartbeat publisher (and subscriber) are still working. Whether or not the node is still performing useful work would not be observable through this.

In ROS 2, you’d probably do this with the support for Deadline, Liveliness, and Lifespan (docs). At least when using DDS.

peci1 · February 3, 2022, 12:15pm

Thanks for your ideas, @gavanderhoorn . I think Bond is quite an overkill for the task at hand.

And yes, I know that the frequency of the heartbeat messages does not say much - there can still be sensor blockage, wrong processing of data etc… However, as the heartbeat should be published right before/after publishing the sensor message, you can at least figure out whether the sensor readout callback is called at the expected rate. For example with Ouster, UDP data can stop coming from the sensor either because of a network failure or because the sensor is overheated… And then the callbacks stop being called. Heartbeat is just the very basic diagnostic helping with a part of monitoring. If you have a critical sensor, you should of course not be satisfied with heartbeats. But in a lot of cases, they could be enough.

StefanFabian · February 3, 2022, 12:57pm

I’m curious why you went with Header instead of a minimal message, e.g. just a sequence number or a timestamp?

I think a heartbeat message is definitely useful, especially in areas where you can not guarantee a stable and constant connection to the robot.

peci1 · February 3, 2022, 1:20pm

Because Header is “special” in ROS. Messages with Header field have special support in a lot of tools like plotjuggler, TopicDiagnostics etc. There is also the C++ type trait HasHeader<M> which tools use.

philipp · February 3, 2022, 1:27pm

In ROS 2, you’d probably do this with the support for Deadline, Liveliness, and Lifespan (docs). At least when using DDS.

Just to add to that, the ROS Safety WG had coded up a Heartbeat/Watchdog that relies on these DDS features. The heartbeat message is also easily added to a node via composition (e.g., in a launch file).

anordman · February 3, 2022, 1:28pm

I never fully understand the purpose, but bond is already somehow built into ROS 1 diagnostics (was not ported to ROS 2 diagnostics), see:

github.com

ros/diagnostics/blob/noetic-devel/diagnostic_aggregator/include/diagnostic_aggregator/aggregator.h#L144-L151


      
            /*!
             *\brief Service request callback for addition of diagnostics.
             * Creates a bond between the calling node and the aggregator, and loads
             * information about new diagnostics into added_analyzers_, keeping track of
             * the formed bond in bonds_
             */
            bool addDiagnostics(diagnostic_msgs::AddDiagnostics::Request &req,
          		      diagnostic_msgs::AddDiagnostics::Response &res);

Maybe you can adapt this to your usecase?

You could also have a look at the watchdogs provided by the ROS safety working group, although currently only implemented fpor ROS 2 (partially based on DDS features):

peci1 · February 3, 2022, 1:52pm

That’s not an integration, it’s plain usage of bond. The analyzer loader creates a bond to the aggregator so that when the loader node(let) is shut down, its analyzer is unloaded from the aggregator.

Thanks, it’s nice how you both wrote about the same thing at the same time =) So it seems ROS 2 already goes into this direction. If I understand it correctly, the linked SW watchdog is only a node-level watchdog and not a publisher-level watchdog, am I right?

EduPonz · February 3, 2022, 2:39pm

Hi guys,

I’m not an expert in RCLCPP API, but I’m pretty sure you can get access to the discovery graph directly, so you can use that to check whether a specific publication/subscription is present: As far as I know, it is possible to get a notification whenever the graph changes. ROS 2 CLI tools like ros2 topic list relay on this mechanism (through the ROS 2 daemon). Furthermore, if an entity dies unexpectedly, DDS has a builtin liveliness mechanism, so all the pieces should already be there IMO.

peci1 · February 3, 2022, 3:05pm

The publisher doesn’t need to die to stop publishing. It’s enough if it receives no data to pass further…

I’m mainly interested in the ROS 1 part, but the ROS 2 is also pretty interesting. As I haven’t used ROS 2 yet, I might have some beginner questions. The first one being - are the liveliness/deadline data from DDS recorded in bag files? I.e., is it possible to inspect from the recordings why did the system fail and find out “oh, this topic stopped being published”. Or is it only possible via some kind of watchdog that records its findings, such as the one linked to above?

reinzor · February 3, 2022, 4:20pm

Are you aware of the Heartbeat diagnostics task that exists within the diagnostics framework?

CPP
Python

This works with all the standard ROS diagnostics tooling, e.g. aggregator and the rqt tooling.

It is only a couple lines of code:

Setup the updater
Add the task
Call updater.update periodically

peci1 · February 3, 2022, 5:33pm

Yes. It does the node-level heartbeat. I am mainly interested into publisher-level heartbeat. Of course, I could create the heartbeat task and call update() when publishing, but the Heartbeat task isn’t anyways supposed to watch any other frequency than 1 Hz (at least not as currently implemented in the downstream tools).

Not always. You should switch to a multithreaded or async spinner to make sure a choking callback will not stop diagnostics publishing. Then it starts complicating.

And to say it again - having the heartbeat published on /diagnostics topic means you need to do string-based parsing/matching to analyze the performance in postprocessing. I don’t consider that a very good option, at least not for things I want to check regularly.

ggrigor · February 3, 2022, 6:02pm

Can you expand on how the “heartbeat” message verifies the sensor is working as expected?

To be concrete, there are multiple levels of perception failures from a LIDAR. We could break this down into sensor interface, sensor data, and perception.

At the sensor interface level, there are errors in communication of the data from the LIDAR to the compute platform running ROS. Dropped ethernet packets. Delays in arrival of the data. Basic errors in the data itself.

For sensor data, is the sensor producing data in spec? Did a HW component fail report any errors. Are the LIDAR points within spec | valid?

At the perception level, is the sensor blinded? Is a soda can or plastic bag covering the LIDAR? Is the LIDAR casing covered in snow, or grease?

Is the heartbeat a simple boolean capturing all of this diagnostic information for a higher level system wide health monitor? Or is this a simple “I’m still here” message, and the above diagnostic work would still need to be performed?

Thanks.

peci1 · February 3, 2022, 6:34pm

The heartbeat messages in a lidar driver node as I see it can tell the following:

Heartbeat is coming
- Heartbeat is coming at desired frequency and with reasonable delay
  - → Lidar is running, network communication works and the driver is processing the data coming from the lidar
- Heartbeat is slower than expected or has an unusual delay
  - → There might be performance problems either on the computer or in the network
Heartbeat is not coming
- → Either the driver is not running, or there is network failure, or the sensor is dead

I think heartbeat cannot and should not be trying to capture any kind of semantic functionality of the sensor. That is left for more advanced analyzers.

As you can see from my little “diagram”, the information the heartbeat tells is usually ambiguous. But practically, I know I want to start a robot and see a green checkmark somewhere telling me the lidar is “basically working” - on, connected, communicating, driver running. Even this simple heartbeat check can summarize that a lot of configuration and communication has been done right.

tomoyafujita · February 3, 2022, 7:42pm

@peci1 thanks for sharing your thought

it requires going through all diagnostics messages and parsing them to find the relevant reports.

off topic, but i think this can be improved by Content Filtered Topic, which is W.I.P.

For example with Ouster, UDP data can stop coming from the sensor either because of a network failure or because the sensor is overheated… And then the callbacks stop being called.

i am not really sure why publishers are related to this? i think whatever happens to publishers, rmw implementation will notify the subscription if this requirement misses.

Heartbeat is coming

Deadline event will not be notified.

Heartbeat is not coming

Deadline event will be notified.

could you elaborate what is missing as requirement with your use case? i am interested in that.

thanks

peci1 · February 4, 2022, 7:35am

@tomoyafujita Thanks for your reply, it brings a lot of insight into the workings of ROS 2. My main objective here, though, is ROS 1. It does not offer any of these RMW goodies.

But staying with your ROS 2 notes, it looks like during runtime, a lot is available to monitor e.g. a driver. I have two questions left:

Can any RMW tool/API be used to only get the rate of publication without actually subscribing? This can be a big deal with Realsense-like sensors which can generate almost a gigabyte/s of pointclouds. As much as possible, I only want to have a single subscriber for this amount of data - some perception node. But I also want to run a separate simple diagnostics node that just checks the data rate (without bloating the perception node too much).
Can any of the deadline events be recorded? Imagine a robot that failed during a mission and you have to investigate the reason. Will you see somewhere that the deadline event was notified? Or do you have to explicitly generate a message on receipt of such event and store it in the bag yourself?

peci1 · February 4, 2022, 7:38am

Just to explain this a little bit more - the update() function will get called many times a second, but the diag. publisher will only aggregate once a second whatever happened and publish this aggregate information. It is not possible to figure out from the /diagnostics topic which exact sensor message got missing.

tomoyafujita · February 4, 2022, 4:12pm

@peci1 thanks for the information!

Can any RMW tool/API be used to only get the rate of publication without actually subscribing?

No i do not think so.

Can any of the deadline events be recorded?

No, i dont think either. i think it is application responsibility to manage after the event is notified.

I think these statistics with endpoint graph would be useful to monitor the distributed system to see if the whole system works okay. just curious, does anyone know DDS supports any statistics API?

EduPonz · February 7, 2022, 6:33am

Fast DDS provides a Statistics module for exactly this purpose. There is also a GUI application Fast DDS Monitor that can be used to monitor the performance of the DDS network in real time. You may be interested in the tutorial to use it alongside ROS 2 Galactic.

rex-schilasky · February 7, 2022, 8:44am

You could use the eCAL RMW and then monitor your system (including frequencies and data sizes) using the eCAL Monitor standard tool.

Topic		Replies	Views
Diagnostic-aggregator and diagnostic-updater porting to ROS2 ROS General ros2	15	5268	January 24, 2019
How do you monitor your robot diagnostics (topic rates)? ROS General ros2	25	1610	January 10, 2025
Additional levels in DiagnosticStatus ROS General	22	837	March 7, 2025
ROS2 speed ROS General ros2 , foxy	48	24705	March 26, 2025
How should support (message types) for multi-exposure images be added to ROS? ROS General	1	985	October 12, 2018

Add Heartbeat message type

Related topics