Challenges with ROS2 on commercial robots

Wilco · March 25, 2025, 11:05am

I would like to get in touch with companies and organizations that use ROS2 on commercial robots or that are developing with ROS2 for commercial robots. I am especially interested in issues that robot developers currently face when bringing a robot to market. Topics could be difficulties with (fast) system bring-up, graceful degradation, use of lifecycle nodes, heartbeats, DDS, debugging and analyzing nodes, hard real-time requirements, communication with other industrial platforms, updating systems in the field, etc. My goal is to identify common challenges that multiple parties encounter and that we potentially could address in applied research projects.

Timple · March 25, 2025, 12:13pm

Sure!

Reach out to us at Nobleo Autonomous Solutions!

I was not familiar with the term “graceful degradation” but looking it up we seem to call it Limp-home mode .

Really curious to see what applied research project will develop from typical commercial-deployed issues.

devx · March 25, 2025, 1:07pm

Hello Timple,

I would love to know what issues you face as well!

ivaughn-spear · March 25, 2025, 2:19pm

We’re building absurdly simple semi-disposable systems-- not even sure I’d call them robots-- that phone data up to the cloud. That said, these also apply to my previous job building more conventional vehicles.

Two related items:

1). Reading mcap rosbags requires a full-- and version-synchronized-- workspace. That’s a pretty big limitation. What I want is to rapidly iterate everything about the system-- message types, topics, node graph, everything-- generating data the whole time and then be able to go back and view logs from 18 months ago without having to dig and figure out which repo versions to check out. If, for example, I’d stored the repo versions in the bag file I’d be in real trouble. Everything necessary to do this is in the mcap files so presumably it’s a matter of updating the rosbag2 APIs. I looked into it; not trivial, although certainly possible.

I assume Foxglove has a good answer to this and everyone who runs into it just pays for their service. Its not horrendously expensive or anything. I haven’t submitted a PR either, so whatever.

2). Lack of optional fields in the ROS IDL. As you may know, this is a common method for dealing with schema evolution. Given that this is available in DDS I’m curious why it was omitted; perhaps some sort of safety critical / MISRA thing? We’ve worked around the issue by translating everything to protobuf before it goes off-vehicle. This works fine for the limited data our comically simple systems send, but would be a real bummer for a large system.

Not really sure what action is even possible on #2, but you asked.

I assume both of this are old complaints rehashed many times during the ROS2 design meetings, so maybe not useful for your purposes. Still, any explanations I can pass back to our Cloud People would be very helpful. Especially on Item 2.

chfritz · March 25, 2025, 2:56pm

Karsten Patzwaldt from NVIDIA gave a talk on just this topic the other day at the ROS-by-the-Bay meetup in Mountain View. I recommend reaching out to him. Some of the things he mentioned were lack of support for non-CPU compute, esp. GPUs and DSPs, lack of deterministic replay, and just reliability of message delivery in general.

RobotDreams · March 25, 2025, 3:15pm

Some good info in

Chuck_Claunch · March 25, 2025, 8:16pm

Here at Nauticus we’ve had a few sticking points, nothing deal breaking, and in no particular order:

Usage of callback groups and which Executor to use / num_threads is not well understood. Particularly when you have ROS callback items happening inside other ROS callback items.
DDS configuration, performance issues with custom messages or large arrays (e.g. point clouds). We’ve had meetings with RTI where they told us to never have more than 60 nodes (DDS participants) in a system. We went with composable nodes to remediate this and it vastly improved some issues we’ve had but the inner workings/why this is necessary is not well understood. Debugging is difficult. When I have a node running on a timer that’s supposed to publish a topic at 1Hz and it takes too long it’s difficult to debug.
ROS comms are mostly not suitable/efficient enough for our extremely high latent needs (subsea robotics using acoustic modems) so we generally roll our own for this piece of the comms path.
Rosdep generally is at the mercy of the developers to choose wisely. Many just throw the kitchen sink in, resulting in large dependencies that may or may not really be needed. It also doesn’t do a great job of separating development dependencies vs runtime/release builds. Standard practice is to just install ros-humble-desktop-full or whatever when you can get by with much less. We heavily dockerize both our development and deployment of code and purely using rosdep on our codebase can result in images in the tens of gigabytes. A manual non-rosdep installation and building only the ROS code and our code we need from source slims that way down into the hundreds of megabytes (for runtime/release mode).

RFRIEDM · March 25, 2025, 8:33pm

For #2, I already filed a request here:

github.com/ros2/rosidl

Feature Request: Optional Attributes

opened 05:20PM - 25 Feb 25 UTC

Ryanf55

enhancement help wanted

DDS supports the [optional attribute in IDL 4.2 specification](https://www.omg.o…rg/spec/IDL/4.2/PDF) attribute section 8.3.1.3. In C++, now that we have C++17, we can use std::optional. A prior dicussion is here: https://discourse.ros.org/t/optional-fields-in-message/991/16 Consider adding a `@optional` specifier to rosidl that translates to the DDS optional . Amend the ROS 2 design documentation for `Standardized Annotations` to add optional: https://design.ros2.org/articles/idl_interface_definition.html It's supported by: * [FastDDS](https://fast-dds.docs.eprosima.com/en/latest/fastddsgen/dataTypes/dataTypes.html#optional-members) * [RTI](https://community.rti.com/static/documentation/connext-dds/current/doc/manuals/connext_dds_professional/extensible_types_guide/extensible_types/Optional_Members.htm) * [CycloneDDS](https://cyclonedds.io/content/guides/supported-idl.html#optional) * [Zenoh](https://github.com/colinhacks/zod/issues/310#issuecomment-794533682) Workarounds: Bounded array with a max length of 1, but the syntax is funky.

doisyg · March 26, 2025, 9:58am

Don’t hesitate to get in touch if you want to know more how Dexory uses ROS 2.
And for an intro: Mobile Robotics Scale-up Leveraging ROS

JM_ROS · March 28, 2025, 2:31pm

Here is our list of main challenges we faced a cellumation:

Error management and system recovery.
- As our machines are deployed in series with other industrial machines, our machines are connected to a common emergency off line, and it is normal procedure that this line is pulled low.
- after an emergency off, error or boot the machine may not start by itself until a ‘control device’ issues a start.
- Note, this is mandatory in the EU. See L_2023165EN.01000101.xml Section 1.2 for more information
- Therefore we needed a supervision system, to stop and restart nodes on trigger events. At the time we implemented this, we did not find any existing solution for this, therefore we implemented our own. I don’t know if something like this exists now as open source.
- The standard lifecycle nodes recovery and the error states were not sufficient for our use case therefore we needed to diverge from the existing solution.
- Showing condensed human readable error messages is harder than it sounds…
The ‘usual’ bugs while porting to ros2…
Vendor agnostic communication with PLCs
And basically everything Dexory said in their talk…

If you have further question, feel free to contract us.

gbiggs · April 1, 2025, 11:01pm

Could you give more information on what wasn’t sufficient and what your solution provides that’s different?

JM_ROS · April 2, 2025, 9:42am

Sure, we added an additional paused state. In this state, the node keeps it state, and still processes data. In this state a node is not allowed to send any commands other than zero / no movement.

A good example of a pausable node would be a path follower, that is controlled by an action client.
We needed this in order to implement a resume behavior, after the emergency off line was pulled.

As for the error states, we added individual parametrized sub error states to each node. These are used in two scenarios:

Determine if a node error results in a recoverable or a non recoverable system state.
Showing error messages on the HMI.
E.g. The camera driver going in a ‘Connection_Lost_Error’ would result in a message like ‘Communication to the camera X lost, please check cable’, While things that should never happen like ‘PTP_Sync_Lost’ would result in ‘Internal error, please contact customer support’.

Topic		Replies	Views
Experiences with ROS 2 on our robots and what we learned on the way General ros2	8	3088	August 28, 2022
Further development of the ROS 2 ABB Driver Drivers ros2 , drivers , ros2_control	1	325	December 9, 2024
ROSBag Data Management for Robotics Projects ROS Projects data	1	2219	July 10, 2024
Industrial_robot_client porting to ROS2 Next Generation ROS ros2	4	1126	October 24, 2018
Looking for rosbag2 files for our research ros2 , research , rosbag2	9	655	June 29, 2022

Challenges with ROS2 on commercial robots

Related topics