Challenges with ROS2 on commercial robots

I would like to get in touch with companies and organizations that use ROS2 on commercial robots or that are developing with ROS2 for commercial robots. I am especially interested in issues that robot developers currently face when bringing a robot to market. Topics could be difficulties with (fast) system bring-up, graceful degradation, use of lifecycle nodes, heartbeats, DDS, debugging and analyzing nodes, hard real-time requirements, communication with other industrial platforms, updating systems in the field, etc. My goal is to identify common challenges that multiple parties encounter and that we potentially could address in applied research projects.

7 Likes

Sure!

Reach out to us at Nobleo Autonomous Solutions!

I was not familiar with the term ā€œgraceful degradationā€ but looking it up we seem to call it Limp-home mode :smiling_face_with_sunglasses: .

Really curious to see what applied research project will develop from typical commercial-deployed issues.

3 Likes

Hello Timple,

I would love to know what issues you face as well!

Weā€™re building absurdly simple semi-disposable systems-- not even sure Iā€™d call them robots-- that phone data up to the cloud. That said, these also apply to my previous job building more conventional vehicles.

Two related items:

1). Reading mcap rosbags requires a full-- and version-synchronized-- workspace. Thatā€™s a pretty big limitation. What I want is to rapidly iterate everything about the system-- message types, topics, node graph, everything-- generating data the whole time and then be able to go back and view logs from 18 months ago without having to dig and figure out which repo versions to check out. If, for example, Iā€™d stored the repo versions in the bag file Iā€™d be in real trouble. Everything necessary to do this is in the mcap files so presumably itā€™s a matter of updating the rosbag2 APIs. I looked into it; not trivial, although certainly possible.

I assume Foxglove has a good answer to this and everyone who runs into it just pays for their service. Its not horrendously expensive or anything. I havenā€™t submitted a PR either, so whatever.

2). Lack of optional fields in the ROS IDL. As you may know, this is a common method for dealing with schema evolution. Given that this is available in DDS Iā€™m curious why it was omitted; perhaps some sort of safety critical / MISRA thing? Weā€™ve worked around the issue by translating everything to protobuf before it goes off-vehicle. This works fine for the limited data our comically simple systems send, but would be a real bummer for a large system.

Not really sure what action is even possible on #2, but you asked.

I assume both of this are old complaints rehashed many times during the ROS2 design meetings, so maybe not useful for your purposes. Still, any explanations I can pass back to our Cloud People would be very helpful. Especially on Item 2.

Karsten Patzwaldt from NVIDIA gave a talk on just this topic the other day at the ROS-by-the-Bay meetup in Mountain View. I recommend reaching out to him. Some of the things he mentioned were lack of support for non-CPU compute, esp. GPUs and DSPs, lack of deterministic replay, and just reliability of message delivery in general.

Some good info in

1 Like

Here at Nauticus weā€™ve had a few sticking points, nothing deal breaking, and in no particular order:

  • Usage of callback groups and which Executor to use / num_threads is not well understood. Particularly when you have ROS callback items happening inside other ROS callback items.
  • DDS configuration, performance issues with custom messages or large arrays (e.g. point clouds). Weā€™ve had meetings with RTI where they told us to never have more than 60 nodes (DDS participants) in a system. We went with composable nodes to remediate this and it vastly improved some issues weā€™ve had but the inner workings/why this is necessary is not well understood. Debugging is difficult. When I have a node running on a timer thatā€™s supposed to publish a topic at 1Hz and it takes too long itā€™s difficult to debug.
  • ROS comms are mostly not suitable/efficient enough for our extremely high latent needs (subsea robotics using acoustic modems) so we generally roll our own for this piece of the comms path.
  • Rosdep generally is at the mercy of the developers to choose wisely. Many just throw the kitchen sink in, resulting in large dependencies that may or may not really be needed. It also doesnā€™t do a great job of separating development dependencies vs runtime/release builds. Standard practice is to just install ros-humble-desktop-full or whatever when you can get by with much less. We heavily dockerize both our development and deployment of code and purely using rosdep on our codebase can result in images in the tens of gigabytes. A manual non-rosdep installation and building only the ROS code and our code we need from source slims that way down into the hundreds of megabytes (for runtime/release mode).
2 Likes

For #2, I already filed a request here:

2 Likes

Donā€™t hesitate to get in touch if you want to know more how Dexory uses ROS 2.
And for an intro: Mobile Robotics Scale-up Leveraging ROS

4 Likes

Here is our list of main challenges we faced a cellumation:

  • Error management and system recovery.
    • As our machines are deployed in series with other industrial machines, our machines are connected to a common emergency off line, and it is normal procedure that this line is pulled low.
    • after an emergency off, error or boot the machine may not start by itself until a ā€˜control deviceā€™ issues a start.
    • Note, this is mandatory in the EU. See L_2023165EN.01000101.xml Section 1.2 for more information
    • Therefore we needed a supervision system, to stop and restart nodes on trigger events. At the time we implemented this, we did not find any existing solution for this, therefore we implemented our own. I donā€™t know if something like this exists now as open source.
    • The standard lifecycle nodes recovery and the error states were not sufficient for our use case therefore we needed to diverge from the existing solution.
    • Showing condensed human readable error messages is harder than it soundsā€¦
  • The ā€˜usualā€™ bugs while porting to ros2ā€¦
  • Vendor agnostic communication with PLCs
  • And basically everything Dexory said in their talkā€¦

If you have further question, feel free to contract us.

1 Like