Add Heartbeat message type

Thanks everybody for sharing the lots of information. I’d like to sum up all of it here.

The usage I’m most interested in is being able to figure out (both live and from recording) the rate of publication and delay (at publish time) of messages on a single topic that are too big to have multiple subscribers or are too big to be recorded all. I also want to be able to check the rate/delay-related properties of individual published messages (i.e. reading their Header). As this need showed up several times in different places during my professional life, I’d like to converge at something that is easily and generically usable in many places and that has hope for having good support in generic diagnostics tools.

ROS1

  • no suitable Heartbeat message type is available in “official” ROS repos
  • runtime checks
    • Topic statistics can be used to check frequency/delay of messages
      • this is a global option - either all topics or no topics (it can be selectively turned on/off for nodes by changing the value of /enable_statistics parameter when launching nodes, but it is impractical as it requires synchronizing the launch sequence)
      • topic statistics do not provide information about which exact message was missing or late
      • all statistics are published to a single topic (/statistics) so finding data relevant for a single topic needs matching the topic name against all messages in the topic
      • it requires no change to existing code, all subscribers implicitly support it
      • the statistics are available only when at least one subscriber is present (but it is not necessary to subscribe to the topic from the statistics-analyzing node)
    • FrequencyStatus and TimeStampStatus diagnostic tasks can be used to check frequency/delay of published messages
      • setup of these diagnostic tasks can be sometimes simple and sometimes more complicated (switching to AsyncSpinner and so on)
      • the tasks do not provide information about which exact message was missing or late
      • getting machine-readable information from the tasks requires string-parsing, which is error-prone
      • the statistics are provided even when there are no subscribers of the topic
    • Heartbeat diagnostic task is not suitable as it reports a plain timer-based heartbeat telling that the process of the node is running
    • Using Bond messages would break the ROS principle of using semantically-fitting message types (bond is inherently point-to-point, publishers are point-to-multipoint)
  • recordable (post-mission debug) checks
    • rate/delay of recorded topics can be found out pretty easily by playing them back and running rostopic hz, looking at rqt_bag visualization etc.
    • non-recorded topic statistics can be found in bag files if topic statistics were enabled and the /statistics topic was recorded
      • to debug a single topic, you need to filter the messages on /statistics by topic name
      • there is no way of figuring out which exact message was missing/late
      • statistics are available only for publishers that had at least one subscriber
      • on a system with a lot of connections, recording /statistics can have noticeable performance impact (observed in SubT Virtual challenge)
    • FrequencyStatus and TimeStampStatus tasks can be recorded and used in playback to give an idea how the publications were working
      • to get exact frequencies/delays, string-parsing is needed
      • there is no way of figuring out which exact message was missing/late

ROS2

  • a suitable Heartbeat message is defined in software_watchdogs_msgs, which might be considered a good candidate for what I am seeking
  • runtime checks
    • some (all?) DDS implementations have proprietary ways of getting publication rate (also delays?) from a running system without actually subscribing to the large topics
      • it seems that these statistics have to be either on or off for the whole publication domain (or did I understand it wrong and can it be configured per-node by passing different env vars to each node?)
      • enabling these statistics requires recompiling the DDS with a non-default parameter (not sure whether only on the publishers or also on the monitoring node)
    • publishers and subscribers can be notified via QoS settings that the topic stopped meeting the criteria and they can react to this
    • GitHub - ros-safety/software_watchdogs: A library of (software) watchdogs based on DDS Quality of Service (QoS) policies and ROS 2 lifecycle nodes. provides a generically-usable lifecycle node that can watch that a process is running (but it cannot monitor deliberate topics)
    • Topic Statistics can be enabled on subscribers
      • requires source code changes
      • there is no way of figuring out which exact message was missing/late
      • no way to avoid subscribing large messages just to figure out the rate/delay
    • FrequencyStatus and TimeStampStatus diagnostic tasks can be used to check frequency/delay of published messages
      • setup of these diagnostic tasks can be sometimes simple and sometimes more complicated
      • the tasks do not provide information about which exact message was missing or late
      • getting machine-readable information from the tasks requires string-parsing, which is error-prone
    • Heartbeat diagnostic task is not suitable as it reports a plain timer-based heartbeat telling that the process of the node is running
  • recordable (post-mission debug) checks
    • no DDS provides any kind of RMW-exposed topic that could be recorded to provide rate/delay information post-mortem
      • DDS vendors provide proprietary tools to analyze a running system performance
    • rate/delay of recorded topics can be found out pretty easily by playing them back and running ros2 topic hz, looking at rqt_bag visualization etc.
    • Topic Statistics can be used to record rate/delay of topics that are not recorded, if /statistics topic is recorded
      • but it is needed that the topics have statistics-enabled subscribers during runtime; moreover, the measured statistics are from the subscriber side, not the publisher side
      • to get exact frequencies/delays, matching the topic name against all messages in /statistics is needed
      • there is no way of figuring out which exact message was missing/late
    • FrequencyStatus and TimeStampStatus tasks can be recorded and used in playback to give an idea how the publications were working
      • to get exact frequencies/delays, string-parsing is needed
      • there is no way of figuring out which exact message was missing/late
    • ros2_tracing can record a handful of various events which can be later used for analysis of the system performance (callback durations and rates etc.)
      • live usage might be possible in the future, as well as integrating DDS analysis

My conclusions from the above analysis are:

  • ROS 1 needs to add a Heartbeat message type, ideally to diagnostic_msgs. It might be just the proposed one (message with just Header), or it might be made forward-compatible with the message definition from software_watchdog_msgs in ROS2 (which would come at the cost of not being compatible with Header-parsing tools).
  • ROS 2 already has a good candidate message definition in software_watchdog_msgs, but it might make sense to move the definition to diagnostic_msgs to allow the message being wider spread.
  • Neither ROS 1 nor ROS 2 provide a reliable ROS-level (i.e. not DDS-proprietary) mechanism for checking rate/delay of published topics in both runtime and postprocessing (without actually subscribing/recording the possibly large topics).
    • It would be interesting to try to integrate the proprietary DDS statistics into RMW, but that’s probably a much broader task.
    • ROS 1 topic statistics are almost there - they provide the aggregate rate/delay info when there is at least one subscriber to a topic (which might be unrelated and does not need to exist just for the statistics collection)

I did my best with the analysis to capture the current state, but please correct me if I got it wrong somewhere (there are a few question marks in the ROS 2 part).


Edit 1 (28/02): Added info about Topic Statistics
Edit 2 (29/03): Added info about ros2_tracing

4 Likes

I haven’t been able to follow this thread completely, but this stood out to me:

Depending on what you want/need out of this, there is a feature in both ROS 1 and ROS 2 called “topic statistics” which can be enable via a parameter and causes the topics to collect stats and publish them on a separate topic. This is designed to let you introspect things like bandwidth and frequency of a topic without subscribing to it directly. See:

https://docs.ros.org/en/galactic/Concepts/About-Topic-Statistics.html
https://docs.ros.org/en/galactic/Tutorials/Topics/Topic-Statistics-Tutorial.html
http://wiki.ros.org/Topics#Topic_statistics

These stats might not have what you want/need atm, but they could easily be extended to include stuff like what you need, I think.

Sorry if someone already mentioned this idea.

2 Likes

Ahh, I completely forgot about topic statistics! Thanks for bringing it here. I’ll add it to the summary.

In both ROS 1 and ROS 2, they provide only aggregated statistical info, so it is not possible to figure out from a bag file which exact message was delayed or missing. They provide a somewhat nicer interface than the FrequencyStatus and TimeStampStatus diagnostic tasks (no string-parsing is required, just topic name matching), but it still merges the stats from all topics to a single topic, which means more work in processing the data.

In ROS 1, they have to be enabled globally. At least from SubT virtual challenge, I remember running with topic statistics enabled and recording their output had a noticeable performance impact (but maybe that was because SubT virtual was a great stress test overall, running 9 robots partly on one computer).

In ROS 2, they are enabled subscriber-wise, which does not meet the requirement of being able to get the rate/delay without actually subscribing.

In ROS 2, it would be nice to be able to enable topic statistics for a publisher. That would solve a part of this problem.

An ideal solution for both ROS 1 and ROS 2 would be the ability to “subscribe to metadata of a topic”. Explained in terms of HTTP, ROS now allows GET and POST requests, but not HEAD requests. That would be exactly what I’d want - subscribe to a topic, but get just the Header part of each message. As the change in communication protocol needed for this would be huge (probably not doable for ROS 1 at all), I want to circumvent it by introducing this “metadata” topic explicitly using the Heartbeat message type.

1 Like

I know this is a very different use-case and it only applies to ROS 2, but if you want low overhead, you can also use ros2_tracing to record execution data. However, the existing data processing tools currently only really support postprocessing (supporting runtime processing wouldn’t be impossible, but it’s a whole other thing). Also, the processing part is of course not as straightforward as subscribing to a statistics topic, but tracetools_analysis makes it pretty easy.

1 Like

ros2_tracing looks definitely interesting! To be usable for live diagnostics, it is still a lot of work to be done, but I imagine a lot of companies will be interested in this =) And if the DDS instrumentation connectors would be implemented, the value of this package would grow even larger. I just haven’t understood from the docs whether it is C+±only or if it works also with rclpy and other non-c++ clients (I see there are rmw-level tracepoints, but the client library tracing seems also important).

We currently only support rclcpp, although I think supporting rclpy would be pretty straightforward, since much of the instrumentation is shared (rcl and below). Feel free to open an issue on the repo to ask more in-depth questions if you want!

Thanks everybody for the fruitful discussion! I’ve created a ROS 1 pull request to add the Heartbeat message: Add diagnostic_msgs/Heartbeat message by peci1 · Pull Request #179 · ros/common_msgs · GitHub .

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.