Additional levels in DiagnosticStatus

Hello. I’m interested in knowing if ROS users would find value adding additional levels in the diagnostic_msgs/DiagnosticStatus message type.

I work on systems with large amounts of ROS diagnostics and it can often be hard to convey to end users what to focus on when issues occur. Now in ROS diagnostics world you can create custom analyzer plugins (and that’s something I’m working on) but I still fine the OK, WARN and ERROR levels to be a bit limiting.

Taking some inspiration from OpenCyphal’s Severity level message I think the following could be useful.

byte INFO=0       # Purely informational
byte OK=1         # Component's diagnostic is in an OK state
byte NOTICE=2     # Level at which user awareness might be recommended but action is not necessarily required
byte WARN=3       # Begin to bring awareness to users, as there might be an issue
byte ERROR=4      # An error condition has been detected
byte CRITICAL=5   # Failure is imminent 
byte ALERT=6      # User attention is required
byte STALE=7      # Reserved for use by the aggregator 

The distinction between INFO and OK would be in some cases we have a diagnostics that just report some values because it’s convenient but the level is not expected to change, as opposed to a diagnostic reporting OK which might not always be OK.

The distinction between ERROR and CRITICAL would be the operational context. For example we do a lot of configuration checksum validation at startup. If there’s a mismatch in what we expect that would be an ERROR. An example of a CRITICAL level might be critically low battery levels. This is not necessarily an error, it is a state of the battery (a BMS on the other hand could report errors and that could be an ERROR level diagnostic).

An example of ALERT might be usage of an emergency stop button. It’s usage is not necessarily an error, but as it is related to safety we need to report it at the highest level possible.

Regardless of specific states, I think more granularity could be helpful.

11 Likes

I have put this on the PMC agenda to discuss today (in 30 minutes, sorry for the late notice!). Will post notes here.

1 Like

Any chance you can share an update regarding this from that meeting for those who were not able to attend?

There was some brief discussion, but I suggested that anyone in the meeting who had interest come and post over here on discourse so that the conversation would happen in a place the community could get involved in.

On top of what @ct2034 said here: Additional levels in DiagnosticStatus · Issue #268 · ros2/common_interfaces · GitHub

One (implementation/migration) note: The current message (DiagnosticStatus — diagnostic_msgs 5.3.5 documentation) has 4 potential values, adding more values between OK and WARN or WARN and ERROR would increment all fields after it and potentially invalidate previously recorded bag data.

Another point brought up was that the current aggregation logic depends on the current ordering, so altering the ordering would also impact that.

We’ve typically operated under the guise that WARN is a problem you can keep operating with but perhaps no longer nominally. ERROR is an issue you can’t continue operating with. I think ALERT is kind of out of scope of here. Whether or not user attention is required greatly depends on the system’s architecture.

I agree in the context of how you define WARN and ERROR, an ALERT might not be relevant.

How do you handle fault muting/suppression in your systems? Do you use custom analyzers? When referring to having more granularity in ROS diagnostic I’m also referring to the aggregated state (multiple diagnostic in context of each other).

“Alert” is one of the ones that I am particularly interested in because I am looking for the distinction of something that needs attention from the user in order to continue operating vs something that needs the attention from a maintainer or developer to fix (E-stop vs a hardware failure). I think having that extra level is a benefit, and not everyone would have to use all of the levels.

Having these extra levels would definitely benefit better filtering and architecting the system.

Especially ERROR, CRITICAL and ALERT are interesting for different use-cases. As we can have errors happening on one level, but we can continue with another task with the robot.

Regrading the regression a solution for those might be to move everything for one decimal place up. This means 10, 20, 30, … This way, one could have both old and new implementations and also have a space for adding new state in the future in between the existing ones.

3 Likes

We’ve traditionally separated diagnostics from operations, particularly because there can end up being a lot of noise and our pipe to the robot is tiny (acoustic modems) so we have a custom setup for fault reporting and tend to go more the “idiot light” / “check engine” route. The ROS provided diagnostics end up being used for debugging in post. The fact that the key/values are not strongly typed and has string messages means we have to do a lot of due dilligence to ensure messages are consistent across different nodes created by different developers. As the operator on a robot with a 10Kbps pipe all I care is whether the robot can continue its mission or not.

On our systems we have a lot more freedom in what we can convey to users. For better or worse we’ve gone in the opposite direction of a “check engine light” and show users a lot of our diagnostics with a thin user-facing layer. This does help us describe to users what to do when issues occur.

But like you said, there’s a lot of noise and users see symptoms of problems. This is why having some additional levels would be beneficial in how we describe issues.

I agree that there are many use cases for more states. It is just that the states and especially their numerical values also imply semantics, because the aggregator aggregates them in a way where the higher numerical value is interpreted as ‘more important’ than the lower one. And with that, I fear that it could be hard to come up with a common understanding of how they are to be prioritized, because their meanings and uses leave more room for interpretation.

The original proposal by @nnarain Additional levels in DiagnosticStatus · Issue #268 · ros2/common_interfaces · GitHub would for example give ALERT a higher number than most others, and from my arguably non-native language understanding an alert sounds a lot less important than an error.

I like the suggestion by @destogl to use 10, 20, 30 for the currently existing ones (OK, WARN, ERROR). And allow the user to define the values in between.

My two cents from everyday academia/r&d.
TLDR: it might seem tempting to add more taxonomy for the sake of better structure, but it would not help us (but possibly clutter the diagnostics UIs depending on the design).

I mostly agree with using diagnostics only outside operations.
Expected problems should be handled in the running framework and not trigger developer errors.
At least originally diagnostics was used to see at a glance whether the system needs to be fixed in some way. That’s best pointed out with the current three main codes and during regular use all components would ideally be OK.

byte INFO=0       # Purely informational
byte OK=1         # Component's diagnostic is in an OK state

Reporting OK for a component already supports additional info keys as other noticed. It’s ok to have OK fields that will never ERROR.
The component is reporting its status and apparently you consider these fields debugging-relevant state over time.
Having INFO as a state leaves it unclear to me whether the component is healthy.
Any logic that decides between INFO and OK in a node might as well trigger an info logging message at that point if it’s really something meant to be informational in a specific situation.

byte NOTICE=2     # Level at which user awareness might be recommended but action is not necessarily required
byte WARN=3       # Begin to bring awareness to users, as there might be an issue

There is no clear difference between these two in our practice.
Most robot developers will ignore both for as long as things work.
They essentially convey “I told you there is an issue, so it’s your fault you ignored it”. And because there is a lot to take in in everyday use, some things have to be ignored :person_shrugging:
Those rare people who strive for an all-OK system will look at both.
And again, similar to INFO, NOTICE usually does not convey a state over time, but something you probably want to see in the log messages.

byte ERROR=4      # An error condition has been detected
byte CRITICAL=5   # Failure is imminent 

I get more people to look at ERROR, but even those are ignored more often than they should because “the setup still works”. Adding more options there makes ERRORs look even smaller.

On the other side when writing a driver, who can decide whether a laser scanner that fails to connect is a CRITICAL failure or just an ERROR?
When the scanner is used as an optional hardware add-on for navigation and I can still use the arms for all the manipulation tasks I want, I don’t want my system to report CRITICAL, because it leads to even more error blindness.

byte ALERT=6      # User attention is required

Again, ALERT does not sound like a status over time.
It’s a point in time when the user needs to be informed and logging, or an independent signaling mechanism that can choose whether the user is in the room and hears a beep or needs an Toast alert/notification on their phone app, seem much more appropriate.

I don’t have any opinion on what to do about E-Stops. We model them as ERROR on some systems.
But after all its (hopefully) the user who triggers them so it’s a user-interaction and thus ideally OK in many contexts.
But it also means the autonomous operation cannot continue and possibly lifted arms can fall down with gravity on some hardware, so it’s an error.
I don’t think there is a clear answer where to put it in diagnostics, but obviously we want it logged.

An argument against making it an ALERT is that if the user triggered it, they know, so they don’t need an alert. If it was a critical lower level safety trigger, it should be an error.

2 Likes

This is an important statement you make. If something is CRITICAL (or even ERROR) is not always up to the reporter.
In our system we have the velodyne driver complaining the velodyne is missing with an ERROR, which makes sense from the drive point of view.

But we know that in some states, the lidar is not even powered and thus this is normal behavior at this system state.

That’s why in the past we already proposed a DowngradeAnalyser.

I don’t have a strong opinion about the reported levels proposed here. But keep in mind that overriding these values in some manner might be just as important.

1 Like

This really depends who “user” is. Are they the engineer who built the thing? Are they the customer who bought the robot and just want it to mow the yard? I have certainly gotten to where I see “the matrix” when watching our raw container output logs (via a custom logging system) but I’d never expect someone in field operations to need that level of logging. (this is why I don’t really see ALERT being in scope here).

My 2c having worked in various robotics companies using ROS: this is only useful when it becomes sufficiently standardized and implemented such that open-source tools will strongly depend on the correct definitions. I just don’t see that happening. More options will lead to more messages being incorrectly classified. Especially with @v4hn and @Timple correctly pointing out that severity really depends on the context and can’t always be modified.

in the ROS diagnostics world you can create custom analyzer plugins

this is what typically happens (in the case that diagnostics are actually strongly used). Perhaps we should think through an open-source tool that makes this easier?

2 Likes

We had to make our own diagnostics (and it does have more levels).
While it would be nice to have more levels this is not the priority to correct ROS2 diagnostics.

We must have a definitive source of the diagnostic; this is a real-time safety requirement.

Roll-ups cannot manipulate the diagnostics and we cannot have a system where any node is capable of mimicking any other node’s diagnostic.

This is at a minimum a loss-of-property issue if-not a loss-of-life issue.

If we insist on redoing levels then the levels should map to levels of intervention action.

The diagnostics need to support a high-level decision list of something like:

Robot may begin new motion and action

Robot may continue already in-progress motion and action
Robot may continue already in-progress motion and action with degraded capability

Robot must limp-home with degraded capability
Robot must abort motion.

Robot must immediately arrest all motion.

The “roll-up” needs to be capability orientated and each sub-system needs to be able to make a definitive quality judgment of its operational capability.

The operating agent can then rely on the diagnostics and use them to make go/degrade/no-go et. al. decisions.

Default state must be Unknown and that must be zero. This catches a class of bugs where things are zero-initialized and transmitted before they are set by data.

Then you work up from there so that the numeric order is monotonically increasing.

Every enum value must be different from other enum values by more than one bit.

This also leaves space in the stack-up for future additions should they be deemed imperative.

e.g.

Unknown = 0x0000,

Fatal = 0x001F,

Faulted = 0x003C

Limp = 0xFFA1

Degraded = 0xFFB2

Warning = 0xFFC4

Nominal = 0xFFD8

Just a general question in line w/the thread I’ve been curious about for a long time. How many folks out there are using ROS diagnostics as means of communicating node to node or as part of their safety system? Are the diagnostics part of the logic required to operate your vehicle? Or purely a means to monitor how the vehicle is doing?

I think that this is an important question about what the scope should be. The diagnostics system was not designed to be a safety system nor a pathway for active automated responses from other nodes. It was designed to provide a human visible status of components as well as keep track of long term logging of hardware related systems.

When looking for more of subsystem management and automatic system responses to failures the ManagedNodes and Node Lifecycle (Design )are much more capable for providing prompt event driven automatic responses across the system. For modern drivers I would expect that hardware drivers should support both Diagnostics and Lifecycle.

But reading through the above discussions about higher fidelity responses and defining the levels based on the necessary response of the robot is not going to be something that can effectively be encoded into the low level driver modules. The most obvious challenge is that an individual hardware sensor cannot know if there’s redundant sensors providing coverage. So the sensor can only report that it’s in error, and the higher level system determines if that’s critical (it’s the only sensor), or if that’s expected (it’s one of many sensors and some are always expected to be non-operational at any given time.)

For the safety types of responses there needs to be an executive that’s listening and aware of the state of the nodes and how failures will effect overall system performance. This can listen to the diagnostics, but it should have higher priority, lower latency mechanisms for monitoring anything critical. The slow continuous publishing of diagnostics is not optimized for safety but for introspection and long term hardware logging.

Also as flagged by @v4hn there’s potentially another element of events that should be acknowledged by the user. These most commonly flow through the console logs. And it might make sense to setup aggregators and analysis tools for presenting to an operator to make sure that they’re aware of an event that has occurred. This also something that the diagnostics are optimized and probably shouldn’t grow in scope to encompass.

2 Likes

Ya so based on this discussion I think it comes down the scope of ROS diagnostics and how those in the community are using it.

I agree that the more levels that are added the more room for interpretation there is.
I also agree that any context additional levels provide has to be done by something above the hardware driver layer (this is why I mentioned fault muting early on or what @Timple describes as a downgrade analyzer).

My motivation for starting this discussion was because I’m working on adding this additional layer of context in our reporting system and it didn’t feel particularly specific to our use cases.

So if this is out of scope for ROS diagnostics that’s fine.

To answer @Chuck_Claunch 's question. We do use ROS diagnostics to drive some logic on our robots. We have a generic configurable node that looks for transitions in diagnostic levels and triggers services. None of this is for safety. We don’t consider ROS to be part of the safety system (the robots will try to make smart decisions, but ultimately what makes the decisions around safety is a certified safety controller).
Also our systems are still primarily ROS 1, so maybe the answer to that is what was suggested by @tfoote and using lifecycle nodes.

1 Like

The most obvious challenge is that an individual hardware sensor cannot know if there’s redundant sensors providing coverage.

That doesn’t matter to that sensor’s diagnostic. The higher level part of the system can combine and reduce their diagnostics.

If that sensor detects a quality issue that renders the data unreliable then its diagnostic report needs to reflect that.

This is also an example of why we have to know, with non-repudiation, which node sent the diagnostic.