Safety-critical WG

At the last safety critical ROS WG meeting we started discussing a potential simple safety architecture. We were trying to think how we could define a ROS based robot architecture that can be certified as functionally safe without having to review the whole ROS stack and the underlying OS. The context was thinking about a safety architecture for a mobile robot base, but the basic concept can likely be applied to other areas (e.g. arms).

This safety architecture would have:

  1. Designated Safety sensor(s) (e.g. a collision avoidance sensor) and a node for it that has the following properties

    a) It accurately places “origin” time stamps on the safety sensor data that knowably relate to the time in which the data was gathered

    b) The accuracy of those origin time stamps is knowable

  2. Node or code within a node that can accurately make stop / go decision based on the safety sensor that has the following properties

    a) It knows the capability of the designated safety sensor (e.g. range)

    b) It knows the accuracy with which the origin time stamps are applied

    c) It knows the data being generated from the safety sensor

    d) It knows the current robot speed

    e) It knows enough about the dynamics of the robot to determine how quickly the robot can stop

    e) From the above it can compute with certainty the minimum time the robot can move at its current speed and avoid a collision.

    f) It will publish a safety-go command if the minimum time is less than the deadman timer + additional time for inaccuracies

  3. A motor controller that has 1 property

    a) It will by default stop within a deadman timer period unless it gets both a safety go command and a command to move

It seems to me that this architecture is rather dependent on accurate clock propagation and probably wouldn’t be resistant to junk clock data or clock jumps. So perhaps the safety node would also have to check for those kind of clock situations.

However the beauty of this architecture is that it makes no other assumptions that anything else works. It does not assume your whole system is real time, it doesn’t even assume that the whole system works. You don’t need to certify the whole ROS codebase or the underlying operating system - you only need to certify items 1-3, and they are relatively simple.

It may even be possible to combine 2 and 3 together as a single monolithic node that accepts safety sensor data, computes the time stamp on that data and sees whether it is safe to continue.

I have a couple of questions

  1. What is the nature of this mythical safety sensor?

  2. How are you going to practically distinguish between a wall that you might drive along besides and human that plans to jump in front of the robot last minute.

  3. What does everyone else think?

To me it’s not clear which applications which usually imply safety levels (SILs) shall be addressed with this architecture. Are you considering homogeneous/heterogeneous redundancy (redundant nodes, redundant topics)? Will redundancy be considered w.r.t. hardware as well? Probably interesting for further design decisions: Functional Safety Design Patterns.

I think the use case needs to be defined first and the hazards derived from it.
The safety system for a small low speed mobile delivery robot in a warehouse will be different from a high speed fully autonomous car driving on highways and residential streets.

There’s overlap between safety and anything. This is one reason why it’s hard to find something specifically for this group to contribute. I don’t see the real-time people conducting an STPA any time soon, though.

Dear all,

On behalf of the organizing committee, I am excited to announce Open Source Software in Safety-Critical Systems Summit will be happening on October 31, 2019 in Lyon, France.

This conference is the second summit in the area of open-source software and safety-critical systems, being a further evolution of last year’s Linux in Safety-Critical Systems Summit. In addition to Linux, this year we would like to include presentations from activities and experts around other open-source projects that aim towards use in safety-critical systems.

The summit will take place alongside Open Source Summit + Embedded Linux Conference Europe 2019 in Lyon, France. It is scheduled the day after the main conference, Thursday, October 31st, 2019, from 8:00 to 17:00 at the conference venue. If you are planning to attend Open Source Summit + Embedded Linux Conference Europe 2019 in Lyon, France, please extend your travel by one day to be in Lyon on Thursday, 31st to join others in-person to present ideas and discuss how to achieve safety of current and future systems that use open-source software.

Please share the conference and the CFP with your networks and experts that have interest in this topic.

The ROS Safety-critical working group can certainly put good contributions to this summit.

We look forward to seeing you in Lyon!

Best regards,

Lukas

2 Likes

Alas, you won’t see me there because I will be at ROSCon in Macau.

Do you plan to have recordings of the talks made available afterwards?

1 Like

@ThiloZimmermann @Levi-Armstrong and @gavanderhoorn this should be on our radar, as it is coming up in conversations with stakeholders. I understand you all may be at ROSCon as well in Macau, however, if you have another team member from either, Fraunhofer, or a partner team that can attend, it may be beneficial.

I will be at ROSCon in Macao as well. But I will check who of my colleagues could be available here.
No doubt, the topic is :hot_pepper:

Unfortunately, I don’t believe that this is true.

If you are implementing this as ROS nodes, you will need to either certify ROS, use a ROS that has already been certified and is certified for your type of application, or provide a justification for why you can trust ROS to do what you think it will. The third one is really, really hard for something as complex as ROS.

The same goes for the OS underneath. That’s why QNX and VxWorks are worth so much.

You can’t assume that stopping is the correct action to take in every situation, so the proposed approach is not necessarily usable. You need to analyse the hazards and risks in your application and understand if stopping is a correct safety feature, and if it is, understand how the robot needs to stop.

Your proposed architecture does not account for things like the motor controller failing, or a mis-fired safety-go signal, or the communications bus sending garbage data that the motor controller interprets as a safety-go signal.

Things like this are why it is really hard to design a generally-applicable safety architecture. It is probably worth trying, but it is not a simple goal.

2 Likes

Dave wrote:
However the beauty of this architecture is that it makes no other assumptions that anything else works. It does not assume your whole system is real time, it doesn’t even assume that the whole system works. You don’t need to certify the whole ROS codebase or the underlying operating system - you only need to certify items 1-3, and they are relatively simple.

I believe this is true. The system he described to me was fail-silent, i.e., if there is no signal, the machine stops moving immediately, and had no real-time requirements that safety relies on, i.e., if a signal comes late, it does not send a signal and stops moving.

If you are implementing this as ROS nodes, you will need to either certify ROS, use a ROS that has already been certified and is certified for your type of application , or provide a justification for why you can trust ROS to do what you think it will. The third one is really, really hard for something as complex as ROS.

Of course, you need to certify ROS for your type of application, but it is clear in this case, the synchronisation and accuracy of time attached to messages is the crucial information, which ROS should not modify. All other properties are not relevant for a safety consideration. Dave might just need to describe the system in a bit more detail, as he described it to me.

For sure, a proper concept and architecture needs to be described and a safety analysis needs to be done, i.e., analyse the hazards and risks in your application and understand if stopping is a correct safety feature, and if it is, understand how the robot needs to stop. In this case, we came to the conclusion that if a safety-go signal with correct time is missing, the motor reacts without causing any harm in any situation, i.e., small speed, known environments.

This whole consideration depends on the system and on the environment, not on ROS or the OS, though. So, it is unlikely that ROS (as middleware), QNX, VxWorks, Linux or any other operating system has already done a system design for that purpose.

That system description and system design should be the starting point for an safety-critical WG. Dave described a system (which I assume fits his use case) and hence safety considerations (or certification) of ROS is limited to a very specific property. That, i.e., naming the very specific property and checking only that with great rigor, is what certification cares about.

It isn’t fail-silent, because it is not guaranteed to provide correct service. It’s fail-safe to one particular type of failure (control commands failing to be delivered to the motor controller in a timely fashion), but there is no guarantee in there that any commands that are delivered will be correct without proving that the node in 2 will do the things it claims, which involves proving a lot about ROS itself.

If you were to implement that architecture without ROS on an embedded microprocessor with no OS, I would say you have a fairly traditional safety monitor and that the implementation would be easy to certify. But because you want to implement them on ROS, you now have to also somehow prove that ROS will support your node in doing the things it claims it will do. There are many techniques that can be used to do it, so it’s not an unsolvable problem, but it is a large quantity of work due to the complexity of ROS.

it is clear in this case, the synchronisation and accuracy of time attached to messages is the crucial information, which ROS should not modify

Firstly, you need to demonstrate that it will not modify them to achieve certification.

Secondly, the control commands themselves are also relevant because you are not just concerned with stopping when a failure occurs, but also that the robot performs correctly when there are no failures, and that other failures that still result in commands being delivered to the motor controller do not lead to incorrect behaviour.

That system description and system design should be the starting point for an safety-critical WG.

I’m not arguing against this; I’ve argued for this previously.

Dave described a system (which I assume fits his use case) and hence safety considerations (or certification) of ROS is limited to a very specific property. That, i.e., naming the very specific property and checking only that with great rigor, is what certification cares about.

What I’m arguing against is assuming that just stopping if no commands arrive is sufficient to be safe and/or certifiable. Certification cares about whether you have proven to a sufficient degree of confidence that:

  • your system’s designed behaviour is safe, and
  • your system will behave as designed.

Yes, a certification agency will care about whether the properties you’ve chosen have been shown to be true, but they will also want to know why those properties are correct and sufficient to be safe.

1 Like

I think we largely agree and I share your opinion that we need to determine precisely how the various elements may impact the overall system and which arguments we really have to claim the absense of such impact.

I hope that Dave can describe in more detail the assumed system architecture and the intended function of the elements (at best also in some structured language). Then, we can apply a safety analysis method (FTA, FMEA, HAZOP, STPA) to systematically derive the relevant aspects that require deeper investigation.

If one wants to use ROS2 in applications with functional safety requirements there may be the need for e.g. detailed design documentation and other things more related to processes than design and/or code. The same could be required for the certified ROS2 in use as well. Have this topic already been discussed?

It has, briefly. Mainly in the context of “Apex.AI is doing this already.”

1 Like

I have the fealing that this approach reject everything on the “safety sensor”.

I have a 3D LIDAR based sensor. It recognize objects in the point clouds and draw danger areas relatively to those objects (within ROS nodes). How can I make a safety sensor from this ? The problem remains (if I understand well this summary of the last SC ROS WG)…

Hello everyone! Thanks to @tfoote I got a link to your working group. I work at NXP Semiconductors on safety of autonomous driving. In particular, we’re investigating options to leverage middleware and hypervisors to build distributed safety mechanisms, involving resource-constrained lock-step cores on automotive-grade SoCs. Hopefully, I can share our (semiconductor) perspective and contribute to the discussions.

I agree with @fkromer - redundancy is key for building safe systems (using less reliable components). The DDS has Quality-of-Service (QoS) polices to support redundancy for fault tolerance, OWNERSHIP QosPolicy. On the other hand, as far as I know, ROS2 QoS policies do not include OWNERSHIP STRENGTH, LIVELINESS and DEADLINE, <no title>.

Would it make sense to discuss a proposal of additional ROS2 QoS policies for redundancy in this group? Are there any other mechanisms in ROS2, which need improvements to facilitate redundancy implementations, such as node lifecycle management or diagnostics?

I think having a simple and effective support for redundancy in ROS2 can boost its adoption in safety-critical systems.

For what it is worth, both LIVELINESS and DEADLINE were added to ROS 2 Dashing as part of Adding deadline, liveliness and other QoS to ROS 2 · Issue #572 · ros2/rclcpp · GitHub. That ticket is still open because OWNERSHIP hasn’t been added yet, but that would be a good place to collaborate with others looking to get the feature in.

2 Likes

How do you see ROS2 running on safety cores in automotive SoCs (System-on-a-Chip)? If it is a well-recognized issue, is it interesting for the safety-critical working group to formulate an appealing proposal for future ROS to address it?

Automotive SoCs, such as NXP S32V234 or Renesas R-Car H3, integrate separated compute and safety subsystems on a single chip. The compute subsystem typically features Cortex-A cores, whereas a safety subsystem includes Cortex-M or R cores. The safety cores lack features (e.g. MMUs) and speed, so they have hard time running full middleware stacks, such as ROS2. On the other hand, they are better suited for real-time (e.g. low interrupt latencies) and safety-critical processing (e.g. hardware lock-step).

In autonomous systems, it is important that the safety mechanism participates in the (ROS) middleware network to perform application-level health monitoring, as well as to react to system faults (ISO 26262) or hazardous situations in the absence of faults (ISO/PAS 21448). In this context, we’ve experimented with DDS-XRCE on safety cores, integrating them this way in a ROS and Cyber RT experimental setup. However, the DDS-XRCE agent required to run on a powerful (and less reliable) core introduces latency and functions as a single point of failure. The latter is particularly troublesome from the safety concept perspective. After all, as far as I know, DDS-XRCE was conceived for resource-constrained IoT devices, which are more volatile than the reliable cloud, whereas our automotive resource-constrained cores are more reliable than their powerful counterparts.

Recently I’ve discovered RTI’s Connext DDS Micro, https://community.rti.com/static/documentation/connext-micro/3.0.0/doc/html/index.html, that seems to need no agent while running on a resource-constrained device, but I’ve no experience with it. Can ROS2 run on top of the Micro or Cert editions instead of the Professional edition?

P.S. I’m not sure if this topic better fits the embedded or safety working group, but the origin of the issue is in safety, I believe.

Yes, I think it’s fine if you want to contribute that.

Yes, this is an unfortunate and significant flaw in the DDS-XRCE design, in my opinion. As an alternative, you should consider some low-resource implementation of DDS, of which there are a growing number. FastRTPS can probably be adapted with some tuning. CycloneDDS may also be a contender, although I need to look more into what its capabilities are.

I don’t think there is an rmw for those versions of DDS, but I have heard rumours that RTI is preparing one for their new version of Connext DDS Micro.

1 Like