Thanks everybody for sharing the lots of information. I’d like to sum up all of it here.
The usage I’m most interested in is being able to figure out (both live and from recording) the rate of publication and delay (at publish time) of messages on a single topic that are too big to have multiple subscribers or are too big to be recorded all. I also want to be able to check the rate/delay-related properties of individual published messages (i.e. reading their Header). As this need showed up several times in different places during my professional life, I’d like to converge at something that is easily and generically usable in many places and that has hope for having good support in generic diagnostics tools.
ROS1
- no suitable
Heartbeat
message type is available in “official” ROS repos - runtime checks
- Topic statistics can be used to check frequency/delay of messages
- this is a global option - either all topics or no topics (it can be selectively turned on/off for nodes by changing the value of
/enable_statistics
parameter when launching nodes, but it is impractical as it requires synchronizing the launch sequence) - topic statistics do not provide information about which exact message was missing or late
- all statistics are published to a single topic (
/statistics
) so finding data relevant for a single topic needs matching the topic name against all messages in the topic - it requires no change to existing code, all subscribers implicitly support it
- the statistics are available only when at least one subscriber is present (but it is not necessary to subscribe to the topic from the statistics-analyzing node)
- this is a global option - either all topics or no topics (it can be selectively turned on/off for nodes by changing the value of
-
FrequencyStatus
andTimeStampStatus
diagnostic tasks can be used to check frequency/delay of published messages- setup of these diagnostic tasks can be sometimes simple and sometimes more complicated (switching to AsyncSpinner and so on)
- the tasks do not provide information about which exact message was missing or late
- getting machine-readable information from the tasks requires string-parsing, which is error-prone
- the statistics are provided even when there are no subscribers of the topic
-
Heartbeat
diagnostic task is not suitable as it reports a plain timer-based heartbeat telling that the process of the node is running - Using
Bond
messages would break the ROS principle of using semantically-fitting message types (bond is inherently point-to-point, publishers are point-to-multipoint)
- Topic statistics can be used to check frequency/delay of messages
- recordable (post-mission debug) checks
- rate/delay of recorded topics can be found out pretty easily by playing them back and running rostopic hz, looking at rqt_bag visualization etc.
- non-recorded topic statistics can be found in bag files if topic statistics were enabled and the
/statistics
topic was recorded- to debug a single topic, you need to filter the messages on
/statistics
by topic name - there is no way of figuring out which exact message was missing/late
- statistics are available only for publishers that had at least one subscriber
- on a system with a lot of connections, recording
/statistics
can have noticeable performance impact (observed in SubT Virtual challenge)
- to debug a single topic, you need to filter the messages on
-
FrequencyStatus
andTimeStampStatus
tasks can be recorded and used in playback to give an idea how the publications were working- to get exact frequencies/delays, string-parsing is needed
- there is no way of figuring out which exact message was missing/late
ROS2
- a suitable
Heartbeat
message is defined in software_watchdogs_msgs, which might be considered a good candidate for what I am seeking - runtime checks
- some (all?) DDS implementations have proprietary ways of getting publication rate (also delays?) from a running system without actually subscribing to the large topics
- it seems that these statistics have to be either on or off for the whole publication domain (or did I understand it wrong and can it be configured per-node by passing different env vars to each node?)
- enabling these statistics requires recompiling the DDS with a non-default parameter (not sure whether only on the publishers or also on the monitoring node)
- publishers and subscribers can be notified via QoS settings that the topic stopped meeting the criteria and they can react to this
- GitHub - ros-safety/software_watchdogs: A library of (software) watchdogs based on DDS Quality of Service (QoS) policies and ROS 2 lifecycle nodes. provides a generically-usable lifecycle node that can watch that a process is running (but it cannot monitor deliberate topics)
- Topic Statistics can be enabled on subscribers
- requires source code changes
- there is no way of figuring out which exact message was missing/late
- no way to avoid subscribing large messages just to figure out the rate/delay
-
FrequencyStatus
andTimeStampStatus
diagnostic tasks can be used to check frequency/delay of published messages- setup of these diagnostic tasks can be sometimes simple and sometimes more complicated
- the tasks do not provide information about which exact message was missing or late
- getting machine-readable information from the tasks requires string-parsing, which is error-prone
-
Heartbeat
diagnostic task is not suitable as it reports a plain timer-based heartbeat telling that the process of the node is running
- some (all?) DDS implementations have proprietary ways of getting publication rate (also delays?) from a running system without actually subscribing to the large topics
- recordable (post-mission debug) checks
- no DDS provides any kind of RMW-exposed topic that could be recorded to provide rate/delay information post-mortem
- DDS vendors provide proprietary tools to analyze a running system performance
- rate/delay of recorded topics can be found out pretty easily by playing them back and running ros2 topic hz, looking at rqt_bag visualization etc.
- Topic Statistics can be used to record rate/delay of topics that are not recorded, if
/statistics
topic is recorded- but it is needed that the topics have statistics-enabled subscribers during runtime; moreover, the measured statistics are from the subscriber side, not the publisher side
- to get exact frequencies/delays, matching the topic name against all messages in
/statistics
is needed - there is no way of figuring out which exact message was missing/late
-
FrequencyStatus
andTimeStampStatus
tasks can be recorded and used in playback to give an idea how the publications were working- to get exact frequencies/delays, string-parsing is needed
- there is no way of figuring out which exact message was missing/late
-
ros2_tracing can record a handful of various events which can be later used for analysis of the system performance (callback durations and rates etc.)
- live usage might be possible in the future, as well as integrating DDS analysis
- no DDS provides any kind of RMW-exposed topic that could be recorded to provide rate/delay information post-mortem
My conclusions from the above analysis are:
- ROS 1 needs to add a
Heartbeat
message type, ideally to diagnostic_msgs. It might be just the proposed one (message with justHeader
), or it might be made forward-compatible with the message definition from software_watchdog_msgs in ROS2 (which would come at the cost of not being compatible withHeader
-parsing tools). - ROS 2 already has a good candidate message definition in software_watchdog_msgs, but it might make sense to move the definition to diagnostic_msgs to allow the message being wider spread.
- Neither ROS 1 nor ROS 2 provide a reliable ROS-level (i.e. not DDS-proprietary) mechanism for checking rate/delay of published topics in both runtime and postprocessing (without actually subscribing/recording the possibly large topics).
- It would be interesting to try to integrate the proprietary DDS statistics into RMW, but that’s probably a much broader task.
- ROS 1 topic statistics are almost there - they provide the aggregate rate/delay info when there is at least one subscriber to a topic (which might be unrelated and does not need to exist just for the statistics collection)
I did my best with the analysis to capture the current state, but please correct me if I got it wrong somewhere (there are a few question marks in the ROS 2 part).
Edit 1 (28/02): Added info about Topic Statistics
Edit 2 (29/03): Added info about ros2_tracing