ROS2 WIFI Multicast Multi Robot and IGMP Snooping

Hi all,

tl;dr Finding a router with IGMP snooping stopped packets dropped due to discovery multicast over wifi.

So recently I’ve been hitting a whole bunch of problems trying to setup ROS2 foxy with multiple drones over wifi. In our system I have a node translating motion capture stream into a topic for each vehicle. Then we have mavros + px4 on the other end receiving the data.

I found that if I turned on the 3rd drone, the motion capture stream would very consistently start to drop packets. In some cases all packets would be lost for 1 second causing my drones to crash. I called this “morse coding” as shown in the screenshot below showing a live Foxglove view of the position data for the 3 drones.

Well after testing everything from capture output rate, #nodes, #topics, updating and switching to mavros2 instead of using ros1_bridge, changing ROS receive buffer sizes and none of it made any change. The most obvious way to recreate was to plug a drone into ethernet, and then switch to wifi which caused the drops to start immediately.

That pointed me to the router and wifi being a problem. I was already aware that multicast over wifi has the potential to fragment packets and cause a mini DDos (If somebody can explain this better, I would love to hear about it). But this is the first time i have definitively seen this in action. I was aware of a technology in modern routers called IGMP snooping where the router unwraps multicast packets to be sent on a normal TCP-like connection.

I borrowed a router (Asus GT-AX6000) with IGMP snooping enabled and it seems to have stopped the majority of the drops. In the following image you should be able to see when I enabled snooping and it eliminates the dropouts almost completely.

I still get drop outs every now and again, but they are at most half a second and happen infrequently - on average one every few minutes or so. But at least now the drones fly a lot more reliably. The ROS comms also feels more responsive (sometimes I had lag after pressing mission go for instance).

Links to this topic too: ROS2 Default Behavior (Wifi) - #40 by srushtibobade , and also this topic about standard ROS2 being harmful to networks: Unconfigured DDS considered harmful to Networks

Hope this post is informative and helps some people out in the future :slight_smile: Was an absolute pain to diagnose so thought writing it up might be informative!

Many Thanks,

Mickey Li

6 Likes

This is a very interesting topic and thank you for sharing!

Did you consider using alternative ROS2 backends like Zenoh (either with its DDS plugin or directly)?

1 Like

Yes I did! I actually tried the following:

  1. Tried Zenoh around June time, could not get it to work
  2. Tried the fastdds router whatever its called and couldn’t get that working
  3. Sat down with Zenoh over discord to try and get it working in my system for 2 weeks, and still couldn’t get it going before I ran out of time and needed to continue with experiments (I’m just a lowly PhD student :') )

I have a funny setup with a set of drones controlled by deploying multiple ROS2 containers deployed over Kubernetes (but everything running network host) which made some of these techniques not as straightforward as the instructions.

Hi,

I am sorry to hear that you had difficulties with getting Zenoh to work in your setup. We actually used Zenoh with DDS plugin also in a container environment (although our orchestrator was simply docker-compose). It also had ROS1 - ROS2 bridge dropped into the mix, because “who wants to live an easy life” :slight_smile: .

In my experience, using the host network with containers has a tendency to cause problems. I would recommend to try to not rely on it.

Kind regards,
Gergely

Hi @mhl787156, I think you may be interested in Fast DDS’ Discovery Server, which is an alternative discovery protocol designed exactly to avoid those situations you are describing (as it does not use multicast). There is a tutorial on how to set it up here. Let me know if you need any help!

1 Like

@mhl787156 thanks for the information, this is interesting.

a couple of questions, if i may.

  • you mentioned that you are currently using foxy, that mean you are using rmw_fastrtps ? if so, do you set up the specific configuration for Fast-DDS via xml?
  • did you ever try to use rmw_cyclonedds to see if the behavior is changed?

@kisg if you know, can you share why and how in technical aspect? i think container runtime is bound via Container Network Interface, in this case host network, so that container should be able to access physical network interface as exactly host does?

thanks

Hi all, thanks for your responses!

@EduPonz thanks for the suggestion, I did try it about 6 months ago, but similar to zenoh I couldnt find the configuration to make it work. I’m not a networks guy so I’m finding it challenging to debug when I either get data or dont get data! It might be worth trying it again next time I have some time.

@tomoyafujita no worries! Yes thats right I am using rmw_fastrtps (basically as close to default settings as possible), I did not get a chance to try cyclonedds, though I might give it a go now that i’ve compiled it into the containers.

I also wanted to add that I did a bit more investigation on this after I found the morse coding coming back every now and again. I was timing the publish rates, and the time it takes to publish of the motion capture data and found that the third publish would always hang, e.g.

[vicon_client-1] [INFO] [1669984724.070035094] [vicon]: Publishing clover13, freq is: 25.302858, publish average is: 0.000025
[vicon_client-1] [INFO] [1669984724.960938683] [vicon]: Publishing clover12, freq is: 25.304929, publish average is: 0.000024
[vicon_client-1] [INFO] [1669984725.423075599] [vicon]: Publishing clover14, freq is: 25.301493, publish average is: 0.063400

So I tried changing the publisher qos reliable to best effort, and that seemed to resolve the hanging. (Having to set the qos on the other side was a pain as I had to manually fork and update foxy ros1_bridge…). No idea if that just masks the problem or actually solves it.

It also occured to me whether the watch -n 1 ros2 topic list I was using to monitor connectivity in the network might have also been inteferring!

As was mentioned in the ros2-default-behaviour post, its not too great that I’ve had to do this much debugging as most of my colleagues would have probably given up wayyy before this point. In fact I have had a number of colleagues who have tried and given up due the curve. It feels like ROS2 needs a dedicated engineering team to properly implement which is a huge challenge for those of us in university research going it alone! It feels to me that the ROS2 default just dont work in a broad enough spectrum of weird and wacky usecases when compared to ROS1 defaults which kind of “just worked everywhere”!

Anyway, hope this is informative :slight_smile:

2 Likes

Hi @mhl787156,

just saw your lightning talk and your project docs. It is a great thing that you are building here. I will also check it out in more detail and will try to get Zenoh up and running on it as soon as I get a chance.

Kind regards,
Gergely