ROS Resources: Documentation | Support | Discussion Forum | Service Status | Q&A answers.ros.org

ROS 2 and Real-time

Ingo, I agree with you that this is a significant effort - however there are many many use cases. If you are relying on low cost camera sensors (either 2D or 3D) for localization and obstacle avoidance you are most likely going to be piping that data through the CPU. If you want to go faster than a crawl you better have some level of determinism as to when that data is going to be processed and a stop/continue decision is made. There are certain types of camera based navigation that just don’t work unless you have deterministic timing and that deterministic timing needs to occur between cameras and things like IMU.

I have customers today (for example in warehouse) who both want the robot to go as fast as possible (to increase throughput) but also want to spend as little on the robot as possible - hence lower cost sensors. Now, I can easily build a fast low cost robot, but unless I am using systems that are outside of ROS its hard for me to ensure it is safe.

Even if I could get part of the system to have deterministic timing that would be a huge boon - for example - if the safety sensors could pass messages direct to the motor controller over a deterministic data fabric (say AVB ethernet) such that the motor controller knows that it has to either slow down or stop that would already help a lot.

I might start there - with that simple subset - a safety sensor that has a RTOS uC and a motor controller that has a similar RTOS uC and them passing DDS messages with deterministic characteristics.

If we can get that to work then we can move on to the CPU oriented applications of camera based navigation.

1 Like

David, I’m all sold on real-time. It’s essential for many applications. I hope my comments above are not implying otherwise. Real-time is essential for predictability, which is what I think you’re aiming for.

I’m even more sold on determinism. In fact, I’m usually credited with putting it prominently on the ROS map through a talk at the ROSCon in 2017. That’s not the same as real-time, however.

When I say that those are big efforts, I’m not saying we don’t have to do this. I’m mentioning it because I care about the how.

And in particular, I think that the how depends a lot on the what for. And Dejan didn’t specify that. So, all I’m asking is that people seriously think about the what for before embarking on this journey.

At the very least, these thoughts will give us some requirements and some metrics. And then we can think about how to achieve that in the easiest way.

It will also give us some common ground to speak about.

For me personally, I care about small consumer robots. So, my answers to the stack Dejan posted are:

  1. We use micro-controllers and they are real-time capable and usually single core, some multi-core, usually homogeneous.
  2. We use a POSIX RTOS, NuttX
  3. We chose hardware that’s supported by 2. Still a pain though.
  4. Not necessary
  5. We use Micro-XRCE-DDS

  6. 6.1) don’t care
    6.2) got some ideas, interest to collaborate
    6.3) got some ideas, interest to collaborate
    6.4) unclear…
    6.5) don’t care
    6.6) HERE’S THE BIGGIE
    6.7) don’t care
    6.8) got it
  7. don’t care
  8. working on it, using SPA approach
  9. proposing system modes approach
  10. don’t care
  11. maybe
  12. we use a HIL approach

Now, if you ask Dejan, I’m pretty sure his answers will be very different. That’s because he has a very different application (autonomous driving). Not sure about what Victor will say, but he’s working on manipulation, which could again be quite different.

So, from my perspective, we’d like to collaborate on some of the points under bullet point 6, and that’s also the core thing for ROS2. For the rest of the things, we are very likely to have a radically different approach.

Therefore, please, before embarking on a “full stack” approach, get your use cases clear and then maybe people will know what they’re getting into and what the goals are.

3 Likes

Great! Now we are getting some good discussion. Ingo, yes we’ve dived in to actions and solutions before getting clear on the problem statement. We’ve also generated a list of actions that mix both development and process and could be more clearly MECE. We are missing chunks - determanistic message passing requires known allocations of bandwidth - which is available with standards like AVB but not, so far, included on the list - we also need to talk a little bit about hardware. How about if we reformulate the discussion as follows for everyone to throw rocks at:

Ultimate goal: To enable universal robotic service

Current Situation: ROS robots cannot operate safely in many domains due to a lack of predictability and determanism that means sensors can’t be relied on to prevent human injury

Desired Situation: ROS2 robots can guarantee safety in environments with unprotected humans by using sensors and by passing data from those sensors through the ROS2 messaging infrastructure.

Problem Statement: How can we create deterministic responses to sensor inputs in ROS2?

Problem Breakdown

Hardware

  • CPU
  • uC (I assume most everything attaches to a standard uC and is determanistic to the uC) *
  • Communications infrastructure
    • High speed (e.g. AVB Ethernet) *
    • Low speed (e.g. SLIP)

Software

  • uC RTOS (e.g. NuttX) *
    • modifications???
  • CPU RTOS (e.g. QMX deriv)
    modifications (e.g. patch RT PREEMPT for Linux, etc)
    drivers for determanistic communications (e.g. bandwidth allocation on AVB)
  • Middleware (e.g. DDS and Micro-XRCE-DDS)
  • Libraries (rmw, rcl, rclcpp)
    • Cleanup for safe implementation
      • introduce safe data types (bounded, check type integrity)
      • perform memory audit (remove unneeded memory allocations)
      • split memory allocation in init and runtime phases, avoid memory fragmentation
      • implement real-time safe log output handler (no logging to console or file)
      • remove all blocking calls (or replace with timed calls, e.g. mutex vs timed_mutex)
    • Cleanup for desirability
      • convert ros2 launch to C++
    • Implement real-time pub/sub (either using Waitset or modified Callback/Executor) *
      • Real-time pub/sub will need to request bandwidth allocation from shared comms infrastructure (e.g. AVB ethernet)
      • Define message length standards
      • Other
  • Communications infrastructure firmware *
  • Services
    • Global error handling (history of failures, core dumps, fail-safe mechanism, …)
    • Real-time safety for higher level concepts, e.g.: services, parameters, actions)

Process

I’d propose to do the bold italic items first with the image of the initial target state being a sensor (say a sonar) passing a message to a motor controller for E-Stop.

As the author of the Ada client library, I’m interested in this topic. Not sure I can contribute much at this time but I will in any case try to keep the Ada library in sync with developments in this area.

Yes, I’d like to join the working group and contribute.

Please count me in for teleconferences!

@LanderU and @abilbaotm, unless I’m traveling, I’ll be in the meeting but to would it be possible for you to participate also and provide a small update of the RT build farm we’re building? I think it’s a great opportunity to get input.

1 Like

I guess some of people interesting in ROS 2 and Real-time are aware of the future DDS/TSN mapping but I did not see any material related to it linked in the discussion, so I think it is worth adding the links here:

I’d also like to attend the working group meetings.

@vmayoral @iluetkeb @davecrawley et al thank you very much for your feedback and also apologies for not replying earlier.

I am with you and I agree that we need a use case. Not listing an item to decide on the use case was an oversight on my side. However I got the following use cases from your replies:

  1. consumer robots (Ingo)
  2. warehouse logistics robots (David)
  3. mobile manipulation robots (Victor)
  4. autonomous driving robots/cars (Dejan, Geoff)

If you want to plus one any of above items or add another use case please let me know.

Otherwise it is clear that alone above 4 use cases are vastly different and will have different requirements, different HW, different middleware and probably also different control and data flows (which will result in different node architectures).

Said that I guess we have 2 options to start from here:

  1. select one of the use cases above and probably loose interest from people having other uses case.
  2. focus on somewhat generic parts of ROS 2 that will help independently of the selected use case.
    My reading of your comments is that these parts are the following ones:
    1. create rmw layers for static and real-time middleware (RTI Connext Micro, Micro-XRCE-DDS)
    2. perform memory audit in rmw, rcl and rclcpp (remove unneeded memory allocations)
    3. split memory allocation in init and runtime phases, avoid memory fragmentation
    4. remove all blocking calls (or replace with timed calls, e.g. mutex vs timed_mutex )
    5. implement real-time pub/sub (either using Waitset or modified Callback/Executor)
    6. integrate tools for static and dynamic code analysis (PCLint, LDRA, Silexica, LTT-ng)
    7. Create node architecture for deterministic execution (policy for message aggregation, nodes cohesion, parallelization, local error handling …)
    8. Create a design for global error handling (history of failures, core dumps, fail-safe mechanism, …)
      1. Create CI for RT testing (e.g. https://github.com/ros2/ros2/issues/607#issuecomment-460319513 )

I am leaning towards the second option.

I’d like to invite for a meeting next week to decide on above and to kick off the work. Could you guys meet on

Note that we have an interest across the globe, so getting a meeting friendly time in every zone will be impossible.

1 Like

That starting time is very Europe-friendly, so yes, it works for me :slight_smile:

How much time do you expect the meeting would take? More than 90 minutes?

That time works for me.

Thanks for organizing @Dejan_Pangercic, works for us very fine. Looking forward!

Thank you @vmayoral for these interesting links.

@vmayoral do you have a written down definition for what soft real-time and hard-real time are? At Apex.AI we have a very informal one for the latter:

In general a hard real-time system can be defined as follows: the entire system must be deterministic, and if a deadline or a data sample is missed the consequence is catastrophic.

@iluetkeb I was asked by a ROS 2 TSC to lead the ROS 2 RT WG. While I do currently work in autonomous driving and have an insight in this domain, my first priority in this role is to make sure that ROS 2 is a success. Hence I will also live with the less comprehensive approach if this is what the rest of you think is better to do.

@davecrawley it seems that here yours and Ingo’s goal somewhat overlap.

@iluetkeb is it unclear if you are interested in this or do you need more clarification from my side?

@iluetkeb can you elaborate a bit more what SPA is?

@iluetkeb can you elaborate a bit more what system modes approach is?

@davecrawley the problem here is that if the sensor does not speak DDS (and hence we can not use DDS’ data model flow) - then we are actually solving a use case specific approach (e.g. is this sensor connected over AVB Ethernet or CAN, which RTOS are we running, do we have a regular network stack or a TSN, …). I am not opposed to jumping on this but lets first decide on whether we do use case specific or generic work.

Otherwise thanks for re-structuring this.

@iluetkeb I want to keep at 60mins.

@Dejan_Pangercic in our view, when speaking about hard-real time systems, missing a deadline implies a system failure (which often can lead to a catastrophic consequence, though not always necessarily). We discuss this topic at https://arxiv.org/pdf/1809.02595.pdf.
Summarized, our view is the following:

Real-time systems can be classified depending on how critical to meet the corresponding timing constraints. For hard real-time systems, missing a deadline is considered a system failure. Examples of real-time systems are anti-lock brakes or aircraft control systems. On the other hand, firm real-time systems are more relaxed. An information or computation delivered after a missing a deadline is considered invalid, but it does not necessarily lead to system failure. In this case, missing deadlines could degrade the performance of the system. In other words, the system can tolerate a certain amount of missed deadlines before failing. Examples of firm real-time systems include most professional and industrial robot control systems such as the control loops of collaborative robot arms, aerial robot autopilots or most mobile robots, including self-driving vehicles. Finally, in the case of soft real-time, missed deadlines -even if delivered late- remain useful. This implies that soft real-time systems do not necessarily fail due to missed deadlines, instead, they produce a degradation in the usefulness of the real-time task in execution. Examples of soft-real time systems are telepresence robots of any kind (audio, video, etc.).

Beyond this, there’s also a nice description of the differences between hard, firm and soft at http://design.ros2.org/articles/realtime_background.html

Hard real-time software systems have a set of strict deadlines, and missing a deadline is considered a system failure. Examples of hard real-time systems: airplane sensor and autopilot systems, spacecrafts and planetary rovers.

Soft real-time systems try to reach deadlines but do not fail if a deadline is missed. However, they may degrade their quality of service in such an event to improve responsiveness. Examples of soft real-time systems: audio and video delivery software for entertainment (lag is undesirable but not catastrophic).

Firm real-time systems treat information delivered/computations made after a deadline as invalid. Like soft real-time systems, they do not fail after a missed deadline, and they may degrade QoS if a deadline is missed (1). Examples of firm real-time systems: financial forecast systems, robotic assembly lines (2).

To me, it is unclear whether blocking calls are an issue currently. For micro-ROS, the main focus is on rmw and rcl, and they don’t have many of those, if any at all. I’ve seen some mutexes being used in rmw layers (e.g., FastRTPS), but since we are not using those implementations for real-time applications (right?), not sure whether they are relevant.

It is a staged execution approach, inspired by Fawkes sense-plan-act (-> SPA) pipeline. In practice there are more than just those three stages, but SPA is the basic motivation. Basically, we assign callbacks to stages, and then execute those stages sequentially.

This is supposed to be more easy to use than classical priority-based scheduling for the typical robotics guy who isn’t a scheduling expert (which include me ;-), and still provide deterministic ordering.

See slide 28 from https://micro-ros.github.io/download/2019-05-07_micro-ROS.pdf for an example.

We’re currently writing this up in a bit more detail, but since it hasn’t been fully implemented, yet, things are subject to change.

See the concept description for the motivation and there is also a modes example.

Most importantly, the concept description makes a connection to error handling, which is the motivation for me mentioning it here.

@Dejan_Pangercic - can you post a new thread with meeting times? I had to search through 17 replies to find this and I would have missed it if you hadn’t mentioned it in the TSC. I’d like to attend and include @mjeronimo and @lbegani also. Is there a Google calendar invite we can be added to?

1 Like

OK so I like following target use cases:

  1. consumer robots (Ingo)
  2. warehouse logistics robots (David)
  3. mobile manipulation robots (Victor)
  4. autonomous driving robots/cars (Dejan, Geoff)

They are real and will keep us focused. The requirements of these use cases obviously differ, but there can be a lot of commonality in the hardware implementation. I’d propose that we agree a model HW architecture that can usefully span all those 4 domains.

All of these systems will basically consist of

a) Sensors
b) uC

  1. uC attached to sensors (I assume that this is always required for any sensor)
  2. uC attached elsewhere in the network (e.g for bridging one network type to another)
  3. uC attached to actuators

c) Networking infrastructure
d) Main host computer
e) Actuators

For real time system purposes we only really care about b,c,d. So I’d propose we agree what model HW we’d work towards. Something like

b) uC : STM32 running Nuttx
c) 1)AVB Ethernet & 2) point to point serial line comms (could be any underlying physical layer - which physical layer is used is not relevant to real time)
d) 1) i86 & 2) ARM

I’d suggest that we target the elements that definitely span all 4 domains in the beginning namely b and c2. I respect @iluetkeb desire not to get in to d in the first instance. We’ll probably get more mileage by sorting out b and c2 first, but the way I see the world - eventually we are going to need real time in d even for relatively low cost applications like warehouse.

Not only the transport layer but everything above as well, from the data link layer (OSI layer 2) up to the ROS 2 rcl (OSI layer 7, and going through all middle layers including the network and transport layers (which we typically refer to as the networking stack), the communication middleware (e.g. DDS), etc.).

The network infrastructure is key. I picked AVB because it is widely used in automotive and widely available. Also in it most of the OSI layers are already real time capable. In particular you have to have bandwidth guarantees (possible in AVB and implied in any point to point protocol) without which I don’t see how we can make a RT system. If you have a sensor that transmits directly on to a broadcast network (e.g. plain vanilla ethernet) that doesn’t have its own individualized bandwidth allocation - I don’t see how you can make that system RT unless you can also guarantee that it is the only thing that is doing so on the parts of the network it is using.

I put serial in there, because well, you are always bound to have some kind of serial comms going on somewhere.

This doesn’t prevent us from expanding with other connectivity choices or uC later on - it just means that we’ll design it with this stuff in mind and can provide a template to build hardware for eventual testing.

@davecrawley the problem here is that if the sensor does not speak DDS (and hence we can not use DDS’ data model flow) - then we are actually solving a use case specific approach (e.g. is this sensor connected over AVB Ethernet or CAN, which RTOS are we running, do we have a regular network stack or a TSN, …). I am not opposed to jumping on this but lets first decide on whether we do use case specific or generic work.

Sure! Whatever sensor you hook up will have to connect to a uC that speaks DDS and connects to our RT middleware fabric. I’d propose we create / define a standard setup for that uC but not really get in to the connection between the uC and the sensor. I think we have to assume that whatever the connection between the uC and the sensor it is deterministic. Most of the sensors I use hook directly in to a uC that I control anyways. It only gets messy when you want to connect a sensor that connects to a shared communications fabric out of the box and that sensor doesn’t speak DDS or respect determinism. For example, an ethernet LIDAR. But there is no way around it! Such a sensor will inject data with non deterministic timing in to a shared and finite communications fabric and as such cannot be deterministic as long as it commingles with other non-deterministic data while it does so. You have to put it in to our real time middleware layer before it comingles with any other non-real time data and so that either means re-programming whatever uC is on the sensor or using a bridge. That bridge will mostly have the same setup as the standard uC discussed above. So I think we just define/agree one standard uC setup.

AVB has the property that it can handle RT and non-RT data streams simultaneously. Obviously we have to figure out how our middleware is going to talk to it though and make sure that it allocates the right amount of bandwidth, for a sensor for example, to ensure the guaranteed quality of service.

1 Like

I would also like to join the meeting on Monday, and would like to propose an item for the agenda that is in particular related to 6.5 (real-time pub/sub).

We have recently investigated the current ROS 2 implementation from a real-time predictability angle. In particular, we have investigated the order in which callbacks are executed. As it turns out, the current implementation has a couple of very surprising properties that result from the interplay between the executor implementation in rclcpp and the rmw API. Overall, the rclcpp executor exhibits a behavior that is somewhere in the middle between FIFO, round robin on the topics, and fixed-priority scheduling. These properties make it very hard to understand and predict the execution order of callbacks, even if the arrival order at the DDS level is known perfectly.

The behavior is described in detail in Section 3 of our paper (which is available as a preprint) and can be experimentally verified using our model validation test.

I think it would be useful to discuss (a) how best to address this issue and (b) which behavior and which ordering guarantees ROS 2 should provide.

2 Likes