I agree.

Right. Reliability (as the other quality attributes) is business means money driven. If unreliable software can get very expensive like in the “finance” web domain people start to implement things to prevent from unreliable software.

If no one complains nothing will change. There is a well known statement in agile sw development: “let everyone feel the pain”… In the end in terms of money.

I think we could prevent a lot of future work if we

  • would consider some concept now
  • think about what they depend on (from a design/implementation point of view)
  • consider these dependencies (design/implementation) asap if the effort is reasonable

Relating to this and the rest of your post, a robot can be a very opaque and complex system and for a complex ROS-based robot there may be many nodes with a lot of data flying around. Making it easy to see where an error originated could go a long way to helping integrators find nodes with quality problems so those can be reported. This means making it more obvious why a node process died, for example. Was it because of a logic error in the node, or because another node published some bad data? (In the later case, both nodes need to be fixed.)

We already discussed the idea of having a “chaos node” sending random message ( with a content following the message format ) to existing topics and services, which helps discovering “unsafe nodes” during development.

But here I notice your focus on post-mortem analysis on “systems in production”. Maybe we could parametrize the “chaos node” to be able to run it in production, just like server farms run some potentially destructive tests in production, but when there is a low demand ?

Post mortem also mean, the system is running in production, but crashes and we want to know what happens. I am not aware of any tools for this purpose yet… Each node can use the tools of the programming language it was written in, but as for the communication between nodes, maybe we should have a library bagging each message, and keeping it for a certain amount of time ? some instrumentation that node writer could add to their code, so that when one node crash, we can get the log of all messages received (and sent) by that node ?

Is anyone aware of such a lib/module, or maybe it s just an extra feature we can add to existing core ROS libs ?

We could use rosbag for this purpose, except instead of storing the topics in a file, it instead stores in a buffer (might be a file) of fixed size/time duration.

In the worst case scenario of seg-fault, we can’t get all the messages that the affected node published/received, but we can certainly get the messages from the “publisher queue” from the memory dump (If we know where to look at). Getting subscriber messages might also be possible in a similar manner (after changes in current subscriber code). The bigger issue would be finding where the the relevant memory sections are given the dump. It would require some tooling, on top of my head something like a gdb plugin CfViz.

Maybe we could run the chaos node and the modified rosbag style logging node together. We could check the number of nodes at user dictated frequency (similar to rosnode list) and store only those messages permanently which originate on topics subscribed/published by a node going off the radar without publishing on a certain topic (only because ROS2 bring in the concept of managed nodes)

Not just in production, but in CI as well. For example, look at the CI system that Fetch Robotics presented at ROSCon 2016. They record all the data they can on the robot so they can analyse it if an error occurs during a test run. Widely-available tools to make analysis of this data straightforward, such that it rapidly directs the developer back from the point of the error through the causality chain, would be immensely useful. A lot of this is about instrumentation, but also a lot of it is about tools to analyse the data that comes out of that instrumentation.