Oh rmw_zenoh, come quickly!

Many years ago I met a missionary who told me about the situation in his home country. Apparently it was so bad that every morning he would pray for the return of the messiah: “Oh Lord, come quickly!”.

In that spirit: how’s the Zenoh RMW coming along? I can’t stand any of the existing rmw’s. I tried cyclone (the default on Galactic) and it was quirky. I’m running fastrtps_cpp right now on Humble, Iron, and Rolling, and the same experience. Sometimes it takes seconds for new nodes to show up. Sometimes I need to do a ros2 daemon stop/start for no apparent reason, or have to restart the nodes. I can’t even configure fastrtps_cpp to go across two networks, connected via VPN. Not exactly an exotic setup, and something I could get working with ROS1 in 5 minutes but there is no tutorial how to do so (only for setup on the same host, AFAICT). I asked for help on the robotics stack almost a month ago: crickets.

I kept thinking that maybe I’m doing something stupid, but I’m not the only one struggling with this. As a maintainer of several drivers I keep getting questions why “my stuff” isn’t working, and I can only evasively blame it on the rmw. For example see some of my user interactions here and here. Am I prematurely blaming issues on the RMW when in fact there’s something wrong with my drivers, or user errors are the cause? I’d be much more motivated to dig deeper if I didn’t have to use “operator skills” myself when working with the current rmw’s.

So the question again: how’s the status of the Zenoh RMW? Is there any prototype available? I’m willing to test and write documentation!

14 Likes

Hi @Bernd_Pfrommer,

this is probably what you are looking for GitHub - ros2/rmw_zenoh: RMW for ROS 2 using Zenoh as the middleware

1 Like

I talked to @clalancette this week about the Jazzy release. I hear we are getting pretty close to “feature complete” for RMW Zenoh (one or two PRs to go). However, we’re going to need the community’s help getting documentation over the line and testing the new RMW “at scale.”

I just put together the plan for the Jazzy Testing and Tutorial Party which should be announced on or around 2024-04-16T07:00:00Z. We’re going to need help from the community to bang on RMW Zenoh before the release. Stay tuned for updates.

16 Likes

In the meantime you can see if the Zenoh bridge works for your setup. It’s awesome!

Hi @Bernd_Pfrommer

You can check Vulcanexus.org , in particular the ROS2 Router. You have some interesting tutorials regarding your scenario here:

https://docs.vulcanexus.org/en/latest/rst/tutorials/cloud/cloud_tutorials.html

It is not you. The current middlewares are completely unusable for our projects (team of professional mechatronics engineers, not hobbyists). Can’t even get fastdds discovery server to work properly.

Zenoh can’t come fast enough. I occasionally work on our older projects on ROS1 and its honestly refreshing how I can debug a problem without wondering if the problem is my application code or if its the underlying transport layer.

3 Likes

I spent more than 3h battling with the ROS2 router and couldn’t get it to work. I tried a lot of different stuff and I consider myself fairly knowledgeable regarding networking and ROS2, but was ultimately defeated by insufficient documentation and missing tools to diagnose what the problem is. I had hoped to get some hints by posting on the robotics stack exchange but that also didn’t help. We really need a solution that is simple to configure and easier to trouble shoot. Like ROS1 was.

1 Like

I would say part part. Our simple one machine setup works stable, but only with cyclone DDS. Fast DDS has some reproducible weird problem, that services won’t be discovered / connected. As far as we can see, the problems becomes way worse, if you use the the multi threaded executor. So we expect some race condition with fastDDS to be cause. We spend multiple weeks on this problem, until we switched to cyclone and the problems were just gone.

That’s exactly why I’m putting so much hope into rmw_zenoh: I got the cross-network setup to work with the connextdds within a few minutes. Some environment variable setting I believe, and it just worked. But that rmw prints nasty license messages into the logs. And do I really want to switch to the rmw-du-jour every time I hit another snag? If that rmw doesn’t work, well, try this one? And then they won’t talk to each other, maybe? This gotta stop. We need one transport that is easy to configure and works for the simple cases. It doesn’t have to scale to a swarm with 1000 robots and doesn’t need to be highly secure, or whatever other advantages the current DDS’s have over the ROS1 transport. I frankly never appreciated the QoS features anyways. Oh, and please, can we have a rmw that by default does not block the publisher when the subscriber is slow to pick up? That one ranked among my most disturbing experiences with the rmws.

5 Likes

@Bernd_Pfrommer you are completely right. It is not our job to be dealing with the transport layer. It is like if we had to deal with the CPU voltage or any other low level detail of the computer when what we have to be working instead is on making the robot intelligent (after all, ROS was supposed to be the layer that allows us to forget about all that and concentrate on the difficult intelligent part).

Anyway, because we suffered so much as you and got so much pissed off, we decided to learn all about the subject and prepare a training to teach it. In case you want to learn why your DDS communications do not work and how to solve it, check our DDS Training for ROS robots in Barcelona next month.

We will teach how to debug those cases and find the solution as well as to understand how all that works inside and relates to Linux networking. We ill also teach Cyclone, FastDDS and of course Zenoh

1 Like

Maybe we can turn this thread into sharing disturbing RMW experiences, mine was porting a small node from noetic to humble and seeing a 20x higher CPU hit.

Turns out lots of small high frequency messages (aka 90% of what’s typically sent through ROS) are a worst case scenario for all existing DDSes. A match made in hell.

It’s even worse for the hobbyist/academic space, I’m at the point where I’m no longer using ROS for small side projects that don’t require any of the large stacks, but roll MQTT instead since it gets the job done efficiently, reliably and with zero platform related headaches.

1 Like

The slowness for high frequency messages (in particular with non-primitive data types) seems to be related to serialization/deserialization, which may not be rmw dependent. Meaning, it’s a ROS2-wide problem. I leave this to be answered by the experts in this area. For sure it’s something I (and probably many other ROS2 noobs) have run into multiple times. That, and ros2 topic hz being so slow for large and/or high frequency messages that it drops them and therefore gives too low a frequency.
It seems like people either walk away from ROS2, or work around such short comings. As long as there is a tolerable work-around, I’ll put up with it. Within-robot I load everything into one composable node. Bye bye rmw. But now I had to go across networks, and juggling multiple RMWs just because none of them works right is getting to me. I’m trying to convince people to use ROS2, but it’s not an easy job when they then try ROS2 and can’t get basic stuff to work. Especially students have very little time to build up expertise, and often nobody to talk to.

2 Likes