We are working on an open-source ros2 command line tool to collect/view all relevant information regarding ros topics in a single place. The inspiration for the project arose from experiencing topic delivery delays in a teleoperated mobile robot resulting in undesired behavior.
From this, we thought that it would be useful to collect and display in a dashboard metrics for critical topics such as latency, bandwidth, and last message time. These would be essential for the reliable operation of a distributed robotic system, or to simply check its health.
The primary idea for the tool is to take a snapshot of a “working system”, derive the nominal ranges for metrics, and set rules/alarms automatically.
For example,
We capture a snapshot of 10 seconds on a fully operational system at the development time.
The tool will generate rules like notifying us if the latency of “/scan” topic deviates 1%.
The user can add/drop rules or change configurations from the UI.
The user runs the tool in the operational environment.
We would like to know from the community if they know of similar tools, whether they use them or would find them useful, or any other feedback would be very welcome.
See this link for a summary of past efforts that somewhat relate to what you’re trying to do.
For a more contemporary and active project, have a look at RobotPerf (source code). You could consider adding new metrics, extending one of the existing benchmarks to meet your use or contribute a new benchmark customized for your needs.
i believe that observability is really important for distributed system like ROS 2.
You may want to look at the existing work on Topic Statistics already available and could be extended.
currently topic statistics only enabled with rclcpp. so if the application uses rclpy or any other client library, there will not be statistics. probably this one could be the 1st step to integrate statistics function into rcl so that any client libraries can take advantage of this.
in addition, off the top of my head
support proxy design based on local statistics agent. (currently just sends the all statistics for global topic from each endpoint) this agent proxy is responsible for collecting statistics in the localhost, and cache and send the global statistic topic. this could be robust even if network offline, and store the statistics in the cache, even can send the chunk of statistics at once. all nodes running in the localhost can take advantage of inter-process communication transport if available. This is one of the option that user can configure, not default behavior. this can be also considered as bridge to the cloud infrastructure to collect the statistics.
statistics API support for user application. for real-time application, it requires latency to know the statistics and change the setting on the node. in addition to monitoring in global bird view, it would be helpful to know the statistics on endpoint, so that application itself can get the statistics and change the configuration as it tells right away.
alert support based on statistics. it would not be probably good enough for user application to know the statistics, i would want to get notified that Hey topic rate is not expected or goes down! or This service started failing! event from this. I think this is beyond the monitoring but probably help user application.
We’ve done the kind of warnings you mention using diagnostics. You have to insert them explicitly, but it’s low effort. This is the simplest approach.
An alternative would be to use tracing. If you want higher temporal resolution than diagnostics while still keeping very low overhead (important if you want to keep this running in production), this is your tool. It can also integrate with other metrics easily, which you might want to use to understand why something happened.
At the moment, the built-in tracer, ros2_tracing, only supports after-run analysis. However the underlying lttng toolkit can, in principle, provide live data. I, or maybe also @christophebedard, would be happy to support efforts to integrate an analyzer for this live approach.