What kind of tools do you use to monitor your robots in runtime? I’m especially interested in monitoring ROS 2 topic rates to confirm for example that all the sensors are publishing data correctly.
I’m looking for a solution that could do this without large overhead, and that would preferably work “from outside”, without injecting code to the subscriptions/publishers. It should publish ROS Diagnostics messages if the expected rates are below the given thresholds.
Is there a solution that can already do this? I’ve so far checked for example ROS 2 built-in Topic Statistics and ros2_tracing, but they don’t seem to fulfill these requirements.
I would be also interested in hearing how you monitor the other aspects of your robots, such as network usage, Node health, etc.
It will publish the rate + delay on the /diagnostics topic. Then we use the diagnostic aggregator to create a dashboard we can see quickly if everything is “green”.
For general diagnostics we launch some monitors from the diagnostic_common_diagnostics package such as cpu, ram, temperature etc. We also wrote a lot of additional diagnostic nodes that all publish to the /diagnostics topic.
it would be useful if you can share why Built-In topic statistics cannot be used for you use case a bit more? i think what’s missing would be good information for mainline development. (btw, built-in topic statistics can only be used with rclcpp, so if we use rclpy we cannot use this option.)
Thank you all for the replies! These tips have been already really helpful.
@Rayman Thanks for pointing out this DiagnosedPublisher. I didn’t know about it existing, and it so far looks like a better approach than some of the ones that I’ve found before.
@bmagyar This talk is amazing! I learned a lot and the Bonsai approach seems to cover most of the requirements. So far it is the most comprehensive approach and comes also with other features, such as Node monitoring.
@tomoyafujita Here is a quick table I made now to compare the possible solutions I know:
No code changes required on publishing / subscribing side
*
Works on all client libraries (rclpy, rclcpp)
Works with composable nodes
Publishes Diagnostics msg
Live monitoring
?
Works in Humble
(?)
*Requires a minimal change: setting Deadline QoS on publishers
One drawback with Topic Statistics is that while it by default publishes the topic statistics to /statistics topic, there is no identifier in the message to figure out from which publisher/subscriber the message arrives. This means that for each topic statistics, we should send them to a different topic, and then subscribe to all of them in a separate Node to publish the Diagnostics messages.
The ros2_tracing package on paper looks like a suitable approach. However, I didn’t manage to get it even running on Humble by following the documentation and tutorials. The documentation has room for improvement and the ROS 2 tutorial page seems to have some errors in the guide. In Humble, the package also requires a huge number of dependencies to be built from the source. I got it running on Jazzy, but even after looking a lot through the documentation and examples, I couldn’t find if it supports live monitoring. In all the examples I found, the data is always logged on the disk and then analyzed afterwards offline.
I’ll continue my search by looking more in-depth into the RMW Stats Shim + Graph Monitor approach from Bonsai! Thank you everyone so far.
We have a related approach not mentioned many times elsewhere, but I still find it quite useful (ROS 1 only so far, but it’s just 1 message definition and topic naming guideline).
For drivers publishing large data (lidar scans, images), we augment the publisher to also publish a Heartbeat message along each large message (on a subtopic like /points/heartbeat). This way, you get a very light way of figuring which exact data were delayed or missing, even if you’re not recording the large messages themselves. The Heartbeat message carries the same Header as the large message, so all things like topic delay and rate can be calculated exactly. You can even subscribe the heartbeat topic over wifi not stressing out the wireless link.
The added value over the standard diagnostics framework is that you know the exact timestamp of the message that went wrong, whereas diagnostics only gives you aggregates over a 1-sec interval.
Also, once the publisher publishes the Heartbeat messages, it is actually quite easy to add a diagnostics node externally to compute message publishing statistics, without paying the large cost for serialization and sending of the large messages.
Interesting approach! This is a better solution than the similar approach, that I was considering in the beginning.
My first instinct was to create a Node that would subscribe to existing topics, calculate the rates, and publish Diagnostics messages based on these readings. However, the problem was that for the large data, there would be a huge overhead and diagnostics would affect the system’s performance.
To solve this, I considered using the ROS 2 feature “Content Filtering Subscription”, as I had (incorrectly) understood that it could receive only parts of the messages: So the idea was to filter out all the message contents, only to trigger the callback without the actual data. However, I quickly learned that the content filtering subscription can only be used to decide if the whole message is received or not. It can’t be used to receive only part of the message’s contents.
I like your message heartbeat approach from the point of view that it would be a lightweight way to externally calculate the topic rates easily, without a huge overhead. One drawback in it is that it requires changes to the source code on the driver side. For example, to get all the sensors’ topic rates, we would have to fork the ROS 2 driver repositories and add these changes there. It would be great if this can be avoided, and the topic rates could be monitored fully externally, for example on the Middleware layer.
Yes, I know of the drawback of this solution. On the other hand, it’s just one publisher and one simple message that needs to be added. But I agree that forking the official drivers is something you want to avoid as long as possible.
What you could do is create a composable node that would create the Heartbeat message by just subscribing to a topic. That would still include some performance penalty, but it could be rather low, and you would pay it just once (while you could use the heartbeats for online diagnostics, bag recording and possibly other stuff). But that excludes Python drivers and I’m not sure how large part of the official drivers allow being run efficiently as composable nodes.
Please open an issue and mention what doesn’t work for you or what you think needs improvement, otherwise we don’t know about it and can’t improve whatever doesn’t work!
I actually talked to @emersonknapp after ROSCon about this and shared a simple starter script to do live trace data processing. I included it in the issue above. This will require you to implement your own (live) trace data analysis in Python, but, depending on your use-case, it might be simpler than you think. I believe it could be quite powerful. If you’d like to go down this road, let me know and I can try to help you.
Although it’s primary purpose is to keep track of the alive status of an entire process (such as a standalone node, or a container of composite nodes), you can use the DDS heartbeat feature to learn if a topic is publishing or not by seating up a heartbeat that is updated for the topic of interest. It does require modifying the publishing nodes, though.
Open-RMF uses heartbeats so you can see how they’re done here.
@gbiggs Interesting approach. Thanks for linking it, I’ll check it out.
@Jaime_Martin_Losa Thank you for mentioning ROS2 Monitor! I have a couple of questions related to it.
Can this data be obtained from Python or C++? I’d like to have Diagnostics messages being published based on the measured data.
The Statistics Module documentation mentions that “… by default, Fast DDS does not compile this module because it may entail affecting the application’s performance.”. Is there any information available on how much overhead the ROS2 Monitor causes?
@christophebedard First of all, thank you for the contributions to ros2_tracing! Right now I see ros2_tracing still being one of the most promising solutions for monitoring the topic rates.
Please open an issue and mention what doesn’t work for you or what you think needs improvement, otherwise we don’t know about it and can’t improve whatever doesn’t work!
Sure, I’ll do this. On a general level, I would find it helpful to have a beginner-friendly introduction to the tracing in its repository and in the ros2 documentation, for the people who haven’t worked with LLTng or tracing before. Just to tell a bit more on a general level about what the tracing is, what can be traced, how the tracing happens (offline into a file, live, etc.), methods of tracing (from Python / C++ code (?), CLI, etc.), and for example most common use cases that the tracing can help with.
Awesome, thanks a lot for posting this example! I see this as a viable option, so I’d like to try it out. I’ll be in touch with you to discuss this further.
Thank you for your response and the helpful information regarding the data acquisition. I appreciate the link to the Statistics Module documentation. However, could you please clarify what specific performance impacts to expect from using the ROS2 Monitor? Understanding the potential overhead will greatly assist in determining if this approach aligns with our application’s requirements.
I greatly appreciate this thread! I’m wondering about folks’ thoughts on diagnostics vs runtime necessary messaging. In other words- do you use/need diagnostics in order to operate your robot or are they indeed only for troubleshooting? If so, do you turn them off during nominal operations?
We run monitoring of available nodes, and nominal topic rates all the time, and it is mandatory for us.
Persons may operate next to our machines, therefore we need to fulfill some standard, that I can’t remember. But the standard comes down to, the machine must self diagnose all the time, and seize operation, as soon as it detects something that is off.
This is the distinction I’m curious about. So does monitoring the topic (say at the RMW level) guarantee that the node/business logic subscribing to said topic is actually doing what it’s supposed to at the rate it’s supposed to do it?
No, it only works the other way around. If you got a driver that is supposed to publish with 30 msgs / second, and it is not, you know for sure, it is broken.
This however does not give you any guarantee, that its actually working correctly.
Therefore this is a easy and fast way, to implement basic error detection, but nothing more.