ROS 2 and Real-time

In one of the previous ROS 2 TSC meeting it was suggested that we form a Working Group in which we will try to analyse the current state of ROS 2 and make it real-time.

To this date we have the following articles about real-time in ROS 2:

  1. Original article by Jackie: https://design.ros2.org/articles/realtime_background.html
  2. ROS 2 ported on some RTOS (https://www.esol.com/embedded/ros.html, http://blackberry.qnx.com/en/articles/what-adas-market-needs-now)
  3. Apex.AI article about porting ROS 1 applications to ROS 2 applications: https://www.apex.ai/blog/porting-algorithms-from-ros-1-to-ros-2
  4. Bosch proposing how to make Callback-group-level Executor real-time https://vimeo.com/292707644

Since real-time is not something that can start and stop within the ROS 2 “borders”, we would like to propose to analyse an entire stack, from the hardware platform to the applications written with ROS 2.

There is many details that we could get lost into but we think that we could start with the following list and elaborate on the items:

  1. Pick real-time capable hardware platform. Decide if we go multi-core many-core or uP.
  2. RTOS (real time operating system). Decide if we go posix or not posix. Add adaptive partitioning scheduler.
  3. Create/Get BSP (board support package) and do modifications (e.g. patch RT PREEMPT for Linux, configure the kernel (e.g. isolate CPUs, remove all unwanted drivers, add other applications))
  4. Explore use of real-time hyper-visor (QNX has one)
  5. Use/create static and real-time middleware
  6. rmw, rcl, rclcpp layers:
    1. introduce safe data types (bounded, check type integrity)
    2. perform memory audit (remove unneeded memory allocations)
    3. split memory allocation in init and runtime phases, avoid memory fragmentation
    4. remove all blocking calls (or replace with timed calls, e.g. mutex vs timed_mutex)
    5. implement real-time safe log output handler (no logging to console or file)
    6. implement real-time pub/sub (either using Waitset or modified Callback/Executor)
    7. convert ros2 launch to C++
    8. run tools for static and dynamic code analysis (PCLint, LDRA, Silexica, LTT-ng)
  7. Check everything above in the STL library
  8. Node architecture for deterministic execution (policy for message aggregation, nodes cohesion, parallelization, …)
  9. Global error handling (history of failures, core dumps, fail-safe mechanism, …)
  10. Real-time safety for higher level concepts, e.g.:
    • services
    • parameters
    • actions
  11. Create reference applications and porting guidelines from ROS1 to ROS2: https://www.apex.ai/blog/porting-algorithms-from-ros-1-to-ros-2
  12. Create CI for RT testing (e.g. https://github.com/ros2/ros2/issues/607#issuecomment-460319513)

We are requesting for comments:

  1. Do you have items to be added/removed from above list?
  2. Do you want to join this working group? We will form a regular ROS working group that uses Discourse for discussions and holds video and in-person meetings.

We are aiming for our first video meeting next week.

D.

8 Likes

This looks like a massive amount of effort. I commend you for even trying, but would caution that only very few people will be able to take part in this.

At the same time, it’s lacking a clear use case. Maybe we don’t need all of that in the first shot. Without a use case, we don’t know. Without a use case we don’t have requirements. So, maybe starting with step 11 would be an idea :wink:

FWIW, I quite liked ROS1’s “just have a real-time capable publisher” approach. It’s, by far, not enough in the long run, but it did get us into the kinds of system we were interested in. It’s also a very modular approach, because it essentially by-passed the “normal” core.

As for the regular ROS2 core, I would love to see some analysis. Maybe we could convert some of your goals into code analyzer rules, and then run it over the code-base to see how many problems there are and where they are located (yes, all of us who have looked know there are quite a few, but it always helps to have some numbers to get a better grasp of the issue).

For the record, in the micro-ROS effort, we are currently targeting Cortex-M class micro-controllers with a POSIX RTOS (NuttX). Reference platforms are mostly single core.

Next week are Easter holidays for many people.

1 Like

Happy to see this kicking off and willing to collaborate as pointed out previously.

I’d add the following articles to the list above which comes from a previous discussion:

and this session in the last ROS-I conference in Europe: System Integration and Modularity in Robotics using ROS

This is the right approach. Same applies to safety and security.

This is a very troublesome topic. One of the points we should answer before selecting any hardware platform is answering the question of “what’s the actual (real-time) target of this WG?” Is it going to be soft real-time? hard-real time? What are the use cases we have in mind for real-time? (this should probably come from the groups driving the work).

Things are very different depending on where we want to go.

We’ve worked on this topic for several years already and humbly accumulated a few good (and not so good) experiences. We ended up moving into a hard-real time capable hardware platform in close collaboration with Xilinx (Intel has similar, rather good alternatives) however I wouldn’t claim such platform will be valid for (hard) real-time in any scenario. Obtaining (hard) real-time typically requires a good definition of the target (whether it’s a joint driver, a LIDAR driver or a motion planner within MoveIt).

There’s no silver bullet and maintaining a specific hardware platform for hard (or even soft) real-time requires a huge amount of resources.

That’s us. I’ll be giving an update today but should be ready in a couple of months. Again, very open to include additional hardware platforms, reach out for more information on how to do so.

We’re obviously :slight_smile: in. Do you need help organizing the meetings? Happy to support and plan the events or organize the meeting notes.

1 Like

Agree with this and aligned with my paragraph above, we need to specify not only the level of real-time we’re aiming for but also the particular use cases this group is willing to maintain to serve as study cases for the community.

I’m somewhat concerned about the “real-time” capabilities we may end-up achieving with micro-ROS’s existing architecture. While the project is still ongoing and much work is left (almost half its way though), the architecture of DDS-XRCE requires a “bridge” to transform XRCE’s client/server (or peer-to-peer if it ends up being implemented) communications into DDSI-RTPS ones (the ones in common ROS 2). Beyond communications between XRCE-native entities, communications won’t interoperate directly with the ROS 2 network and such bridging will introduce a compromise for real-time applications. Including such “bridge” as one of the cases of study will benefit the (micro-ROS) project.

From our (Bosch CR) side, we only target the MCU itself as the real-time capable device.

Apart from pragmatic reasons (this makes it much easier), in our current products that’s all we need.

I would guess that this assumption is true for many products, because it is a very common architectural approach to keep the real-time safe parts a very small part of the overall system. Of course, AD is a notable exception, which is probably why Dejan is so interested in a more comprehensive approach :wink:

Last, but not least, we’re very interested in wireless communications and hard real-time is out with that anyway.

It depends on the implementation of the bridge. Assuming the transport layer is real-time capable (e.g., wired), the bridge itself could maintain real-time guarantees.

Very much agreed!

While I can’t speak on its behalf yet (or how deterministic it’ll be), there’s some interesting work we’ve been exploring with a local research group here in our area that’s extending TSN for wireless communications. Let me know if you’re interested and I’ll connect you.

Soft real-time guarantees probably, but hard real-time ones require the bridge itself to also be hard-real-time. Not only the transport layer but everything above as well, from the data link layer (OSI layer 2) up to the ROS 2 rcl (OSI layer 7, and going through all middle layers including the network and transport layers (which we typically refer to as the networking stack), the communication middleware (e.g. DDS), etc.).

AFAIK, real-time isn’t a goal in micro-ROS project but I’d agree that we should indeed consider it and commit resources to it.

I just want to highlight that having a bridge, as an architectural choice, does not prevent us from achieving real-time capability. Of course, the current implementation of the xrce-dds-bridge we use in micro-ROS is a different story. A lot of work remains to be done there, on all the layers you mentioned and probably internally as well.

I would like to join this working group and contribute.

Ingo, I agree with you that this is a significant effort - however there are many many use cases. If you are relying on low cost camera sensors (either 2D or 3D) for localization and obstacle avoidance you are most likely going to be piping that data through the CPU. If you want to go faster than a crawl you better have some level of determinism as to when that data is going to be processed and a stop/continue decision is made. There are certain types of camera based navigation that just don’t work unless you have deterministic timing and that deterministic timing needs to occur between cameras and things like IMU.

I have customers today (for example in warehouse) who both want the robot to go as fast as possible (to increase throughput) but also want to spend as little on the robot as possible - hence lower cost sensors. Now, I can easily build a fast low cost robot, but unless I am using systems that are outside of ROS its hard for me to ensure it is safe.

Even if I could get part of the system to have deterministic timing that would be a huge boon - for example - if the safety sensors could pass messages direct to the motor controller over a deterministic data fabric (say AVB ethernet) such that the motor controller knows that it has to either slow down or stop that would already help a lot.

I might start there - with that simple subset - a safety sensor that has a RTOS uC and a motor controller that has a similar RTOS uC and them passing DDS messages with deterministic characteristics.

If we can get that to work then we can move on to the CPU oriented applications of camera based navigation.

1 Like

David, I’m all sold on real-time. It’s essential for many applications. I hope my comments above are not implying otherwise. Real-time is essential for predictability, which is what I think you’re aiming for.

I’m even more sold on determinism. In fact, I’m usually credited with putting it prominently on the ROS map through a talk at the ROSCon in 2017. That’s not the same as real-time, however.

When I say that those are big efforts, I’m not saying we don’t have to do this. I’m mentioning it because I care about the how.

And in particular, I think that the how depends a lot on the what for. And Dejan didn’t specify that. So, all I’m asking is that people seriously think about the what for before embarking on this journey.

At the very least, these thoughts will give us some requirements and some metrics. And then we can think about how to achieve that in the easiest way.

It will also give us some common ground to speak about.

For me personally, I care about small consumer robots. So, my answers to the stack Dejan posted are:

  1. We use micro-controllers and they are real-time capable and usually single core, some multi-core, usually homogeneous.
  2. We use a POSIX RTOS, NuttX
  3. We chose hardware that’s supported by 2. Still a pain though.
  4. Not necessary
  5. We use Micro-XRCE-DDS
  6. …
    6.1) don’t care
    6.2) got some ideas, interest to collaborate
    6.3) got some ideas, interest to collaborate
    6.4) unclear…
    6.5) don’t care
    6.6) HERE’S THE BIGGIE
    6.7) don’t care
    6.8) got it
  7. don’t care
  8. working on it, using SPA approach
  9. proposing system modes approach
  10. don’t care
  11. maybe
  12. we use a HIL approach

Now, if you ask Dejan, I’m pretty sure his answers will be very different. That’s because he has a very different application (autonomous driving). Not sure about what Victor will say, but he’s working on manipulation, which could again be quite different.

So, from my perspective, we’d like to collaborate on some of the points under bullet point 6, and that’s also the core thing for ROS2. For the rest of the things, we are very likely to have a radically different approach.

Therefore, please, before embarking on a “full stack” approach, get your use cases clear and then maybe people will know what they’re getting into and what the goals are.

4 Likes

Great! Now we are getting some good discussion. Ingo, yes we’ve dived in to actions and solutions before getting clear on the problem statement. We’ve also generated a list of actions that mix both development and process and could be more clearly MECE. We are missing chunks - determanistic message passing requires known allocations of bandwidth - which is available with standards like AVB but not, so far, included on the list - we also need to talk a little bit about hardware. How about if we reformulate the discussion as follows for everyone to throw rocks at:

Ultimate goal: To enable universal robotic service

Current Situation: ROS robots cannot operate safely in many domains due to a lack of predictability and determanism that means sensors can’t be relied on to prevent human injury

Desired Situation: ROS2 robots can guarantee safety in environments with unprotected humans by using sensors and by passing data from those sensors through the ROS2 messaging infrastructure.

Problem Statement: How can we create deterministic responses to sensor inputs in ROS2?

Problem Breakdown

Hardware

  • CPU
  • uC (I assume most everything attaches to a standard uC and is determanistic to the uC) *
  • Communications infrastructure
    • High speed (e.g. AVB Ethernet) *
    • Low speed (e.g. SLIP)

Software

  • uC RTOS (e.g. NuttX) *
    • modifications???
  • CPU RTOS (e.g. QMX deriv)
    modifications (e.g. patch RT PREEMPT for Linux, etc)
    drivers for determanistic communications (e.g. bandwidth allocation on AVB)
  • Middleware (e.g. DDS and Micro-XRCE-DDS)
  • Libraries (rmw, rcl, rclcpp)
    • Cleanup for safe implementation
      • introduce safe data types (bounded, check type integrity)
      • perform memory audit (remove unneeded memory allocations)
      • split memory allocation in init and runtime phases, avoid memory fragmentation
      • implement real-time safe log output handler (no logging to console or file)
      • remove all blocking calls (or replace with timed calls, e.g. mutex vs timed_mutex)
    • Cleanup for desirability
      • convert ros2 launch to C++
    • Implement real-time pub/sub (either using Waitset or modified Callback/Executor) *
      • Real-time pub/sub will need to request bandwidth allocation from shared comms infrastructure (e.g. AVB ethernet)
      • Define message length standards
      • Other
  • Communications infrastructure firmware *
  • Services
    • Global error handling (history of failures, core dumps, fail-safe mechanism, …)
    • Real-time safety for higher level concepts, e.g.: services, parameters, actions)

Process

I’d propose to do the bold italic items first with the image of the initial target state being a sensor (say a sonar) passing a message to a motor controller for E-Stop.

1 Like

As the author of the Ada client library, I’m interested in this topic. Not sure I can contribute much at this time but I will in any case try to keep the Ada library in sync with developments in this area.

Yes, I’d like to join the working group and contribute.

Please count me in for teleconferences!

@LanderU and @abilbaotm, unless I’m traveling, I’ll be in the meeting but to would it be possible for you to participate also and provide a small update of the RT build farm we’re building? I think it’s a great opportunity to get input.

1 Like

I guess some of people interesting in ROS 2 and Real-time are aware of the future DDS/TSN mapping but I did not see any material related to it linked in the discussion, so I think it is worth adding the links here:

I’d also like to attend the working group meetings.

@vmayoral @iluetkeb @davecrawley et al thank you very much for your feedback and also apologies for not replying earlier.

I am with you and I agree that we need a use case. Not listing an item to decide on the use case was an oversight on my side. However I got the following use cases from your replies:

  1. consumer robots (Ingo)
  2. warehouse logistics robots (David)
  3. mobile manipulation robots (Victor)
  4. autonomous driving robots/cars (Dejan, Geoff)

If you want to plus one any of above items or add another use case please let me know.

Otherwise it is clear that alone above 4 use cases are vastly different and will have different requirements, different HW, different middleware and probably also different control and data flows (which will result in different node architectures).

Said that I guess we have 2 options to start from here:

  1. select one of the use cases above and probably loose interest from people having other uses case.
  2. focus on somewhat generic parts of ROS 2 that will help independently of the selected use case.
    My reading of your comments is that these parts are the following ones:
    1. create rmw layers for static and real-time middleware (RTI Connext Micro, Micro-XRCE-DDS)
    2. perform memory audit in rmw, rcl and rclcpp (remove unneeded memory allocations)
    3. split memory allocation in init and runtime phases, avoid memory fragmentation
    4. remove all blocking calls (or replace with timed calls, e.g. mutex vs timed_mutex )
    5. implement real-time pub/sub (either using Waitset or modified Callback/Executor)
    6. integrate tools for static and dynamic code analysis (PCLint, LDRA, Silexica, LTT-ng)
    7. Create node architecture for deterministic execution (policy for message aggregation, nodes cohesion, parallelization, local error handling …)
    8. Create a design for global error handling (history of failures, core dumps, fail-safe mechanism, …)
      1. Create CI for RT testing (e.g. https://github.com/ros2/ros2/issues/607#issuecomment-460319513 )

I am leaning towards the second option.

I’d like to invite for a meeting next week to decide on above and to kick off the work. Could you guys meet on

Note that we have an interest across the globe, so getting a meeting friendly time in every zone will be impossible.

2 Likes

That starting time is very Europe-friendly, so yes, it works for me :slight_smile:

How much time do you expect the meeting would take? More than 90 minutes?

That time works for me.