ROS2 Default Behavior (Wifi)

tl;dr: ROS2-Wifi default setup is not great and gives a bad experience to users without a strong DDS/networking background.

Hi all,

If you don’t know me, I play around with mobile robots over in Navigation2 and related projects. This is a little long-winded because I wanted to be thorough, feel free to skim. The majority of folks involved in ROS2 have been jack of all trades, ROS-specific developers, or large companies with resources. As such, its been unideal but relatively unavoidable that a robotics developer has had to wander deep into the forbidden forest that is DDS to get things to work in ROS2, many times without a guide. Many large companies have this expertise in house and the jack of all trades welcomes the challenge, but this isn’t scalable nor a way to offer an enticing out of box experience to bring people into ROS2. As we enter Foxy, as the next LTS release, and new development is increasingly gearing to ROS2 (as evidenced by ROSCon ROS2 talks continuing to climb), I think its an appropriate juncture to talk about some default behaviors for the average user. This post is about Wifi, but I suspect in the comments below others will bring up complementary painpoints.

Everyone has a different look at what an “average user” is, so I want to outline here my view of an average user from my anicdotal experience:

  • Some robotics experience, whether it an undergraduate course or a masters program
  • Range of software experience, from basic python scripting to professional C++ development
  • Hired/researching planning, cloud, applications, sys admin. Usually some level of Linux-level networking. Being able to: nmcli, ifconfig, setup static IPs, and easy-RSA3 puts you in the top 10%.

To this profile of an average user, ROS should “just work”. You make a publisher for your work, your robot subscribes to it, you install an open-source driver packages to make your robot run. If you’re a mobile robotics person or want ROS2 networked across multiple machines over Wifi, you’re now in for a world of hurt. You’ve heard about this thing called DDS and you think its pretty neat in concept, but now you find that you’re not even getting 15 fps camera feed in rviz on corporate wifi you easily got in ROS1 and you need to fix this by Friday.

It has become evident to me that this is a serious problem recently through a string of emails from users exasperated as ROS2 has been picking up steam from notions that ‘ROS2 is ready’. I knew this was a learning hurdle but it didn’t occur to me how challenging the average user would find setting this up and the things they’d try to work around it. Managing the Navigation2 project and communicating with groups using mobile robotics technology has given me a taste of their experiences trying to get basic demo running over wifi. I have even found groups in large corporations questioning the maturity of ROS2 to be used on any product at all. The concept of going to the DDS layer is not intuitive to users that they can’t “do this through ROS2” in some fashion. I think that’s simply an idea that needs to be dispelled, but it gives you an insight of how other people look at this that haven’t done the dive to “know better”.

Wifi seems to be being treated as an after-thought or a special edge case, rather than the default. Much of the limited documentation you find on this topic blames the router (which doesn’t induce confidence) or provides a long-winded explanation of what’s going on in domain-specific jardon. From the perspective of the average user I describe above: “I don’t care, how fix?”. We should be giving them some configuration to use that “just works”, but moreover, I think support for this should be the default configuration such that the average user never needs to know there’s any other way. If you’re building a product, at some point someone needs to optimize the network. That person isn’t the developer working on robotics demos or development whom currently needs to overcome this struggle in efforts to complete a basic and unrelated task.

This isn’t to say that I think that all robotics applications require this. I venture to assume the industrial manipulator crowd may be less interested in this and autonomous driving doesn’t require Wifi support at all. The question I pose is: Are the DDS vendor’s settings to work on Wifi degrading average performance so poorly for those use-cases that they can’t make basic demos? If not, which I suspect, then this would be an excellent middle-ground solution to make the configurations for Wifi default behavior on all Tier-1 DDS vendors / RMWs. I believe it should be our goal that > 2 sigmas worth of users shouldn’t need to know or care about DDS for their day to day development. At the moment, that’s closer to -1 sigma.

21 Likes

Actually, taking this a bit broader and as someone who has gotten the same questions many times, starting in 2012 with ROS 1 (“This is crap, it doesn’t work, I cannot create a 10 kHz current-control loop distributed over a Windows machine, a Mac and your Linux. Isn’t that what this R-O-S is supposed to do?”), I support this.

In the end, after explaining sufficiently, all engineers are willing to understand there are limits to technology and that things may not work out-of-the-box immediately.

But it does not look good when you have to get all apologetic about a technology stack. Especially not one which is geared towards use in production environments (ROS 2 == ROS-for-Products).

And, more often than not, there is no one around with sufficient in-depth knowledge to explain why something isn’t working, and how to change it such that it will work. Finding a configuration which can improve the out-of-the-box experience on a set of use-cases which are common would be tremendously valuable.

As in real-life, first impressions count.

If a single configuration isn’t feasible, perhaps a set of profiles could be introduced? “Switch to the wifi/lossy-network profile if you’re not using a cabled network”.

But that’s all assuming wifi actually needs special configuration, and it 's not something else which is affecting user experiences here.

Are you referring to ros2/rmw_fastrtps#315 among others? That wasn’t specifically blamed on the router I believe. It was more a combination of factors (but the router was among those). I believe the FastRTPS documentation was extended with some more info on use over wifi as a result of diagnosing those issues: Typical Use-Cases: Fast-RTPS over WIFI. I could imagine other vendors have similar documentation.

3 Likes

Thank god. When I saw you typing my brain went to “I just opened a can of worms that’s going to make me look reeeeal dumb” (which I’m OK with on occasion or I wouldn’t be publicly posting things like this to open myself up for it).

No, not that in particular. I don’t want to fish out quotes from specific places because that’s not the point. I’m not trying to call out any specific group, person, or comment in particular, but the body of documentation that exists has some common themes regardless of vendor or company. I don’t want to get into pedantic arguments over a thematic topic. It doesn’t seem anyone built a DDS implementation with Wifi as a first-class citizen or at least doing so by default. As a result, while ROS2 needs Wifi to be a first class citizen, its taken a back seat of focus and no one’s making a stink about it yet. Some have done better jobs than others in supporting it after the fact, but their fixes aren’t default on in ROS2. ROS2 needs to support Wifi well, plain and simple, regardless of the technical challenges of implementing that in DDS. And to be more to the point, any DDS vendor that fails to do this will lose the entire mobile robotics market instantaneously.

The answer for some vendors might just be to make the default xmls for each of the vendors include the flags required to get it working on wifi by default. The answer for other vendors might be reworks of their systems to support it sufficiently. But my proposition from that post is that Wifi needs to be out of the box working without excuses or astericks.

So just to make clear: I don’t believe our use-cases or requirements are very special.

I cannot imagine that a DDS implementation capable of running an entire theatre-of-war operations management application or an implementation that routes between the ISS and the mobile-manipulation-in-space lab around the corner from me cannot deal with (corporate) wifi and a few lost packets.

To make this actionable: would you have examples (preferably: detailed accounts) of what worked, what didn’t, what was tried and perhaps also how to reproduce those circumstances?

Second: there must have been some discussion about use of DDS over lossy networks with DDS vendors. Either by OR, corporate users of specific implementations, etc. I’m hoping people with experiences will contribute to your thread.

It would be great if we could re-use experiences and experiments from other user(s) (groups), as that would potentially save us a lot of work.

The fora of RTI, Adlink, Twin Oaks et al. are full of Q&A about this sort of topic.

@smac: have you reached out to any vendors about this already?


Edit:

I at least try to always be reasonable and respectful in my comments, here and on ROS Answers. Your comment makes me wonder whether I should try harder.

Not off hand or anecdotally. I think this is more of a thematic question that “Yes, we want to make good Wifi support default behavior in ROS2” and then the experts in this area (e.g. vendors, OR, etc) make the necessary adjustments to enable, whether configuration or feature development. In terms of what doesn’t work, the current defaults or requiring a complete pre-determined list of ports (unless ports between runs and reboots and multiple machines are identical and scalable). In terms of what works “best” for ROS2 or per vendor, I leave that to the experts. I won’t pretend to be that.

I’ve talked with ADLINK and eProsima to learn that default behavior basically doesn’t support it and you need to go into the XML to make some modifications. Their solutions are very different. I don’t think there must be some consistent solution between vendors, but by my proposition, whatever their default_[vendor]_rmw_config.xml file must contain the solution to work reasonably on Wifi with no other modifications. Whereas the current default configuration does not.

You are, we just disagree on occasion or you have another vantage point. Your alignment on this indicates to me that I probably hit on something that effects a larger crowd than I expected.

Take this with a grain of salt as I am not actively involved with DDS testing.

Would it be productive to create some sort of WiFi/DDS burn-in tests that motivate vendors to create adequate configuration profiles? For example, we have a test, bound to physical hardware, that moves a large bag of data from a remote machine to a test machine over an actual wifi connection (perhaps with real or simulated traffic). Each test could runs some number of times and aggregate statistics could be recorded. I could see different configurations being reasonably easy to set up if we ever get back in the office.

3 Likes

While I’m always a proponent of automated testing, a word of caution: it would be easy to make a test that encourages vendors to optimize the wrong thing.

1 Like

Well without some representative scenarios, or at least characteristics of the network types and configurations which you describe it’s going to a bit difficult to know when it’s actually been fixed.

It sounds like there are two sides to this:

  1. figuring out the appropriate parameters for DDS over lossy networks (with potentially long delays)
  2. providing an easy/ier way for ROS 2 users to activate such a configuration

Personally I wouldn’t necessarily want the default configuration to work for everything. I’d be OK with the default being for wired networks, but at least have a clear and well known path for those using wireless / lossy networks to set things up to work for them as well.


Sounds like a WG in the making? Or would there be a WG which could take this on already?

Hi guys,

This is a “hot” topic, and we have been working during the last 2 months to benchmark the behavior of ROS2/DDS on WIFI scenarios.

In order to do that, we have this cool raspberry pi farm fully dedicated to our tests:

30 Raspberry pi connected to wifi networks, using different routers, and running different cases over wireless and wire setups, to benchmark everything. In a single Raspberry pi we can have multiple ROS2/DDS participants, with many publishers and subscribers.

Everything is automated, and we can run any test just with a single command. We are preparing an article with the conclusions.

In short: It is true that many wifi scenarios have a lot of packet loss, especially when using multicast, and this is more noticeable when you have many nodes on the same wifi network. But in simple wifi scenarios, ROS2 works out of the box.

We have written some tutorials on how to setup Fast RTPS when you notice such problems as an initial guide:

Fast RTPS typical use cases

and we have already some good data, in complex scenarios, for example, 30 Raspberry PIs, 1 Participant per Raspberry PI, with 5 publishers and 5 subscribers, using discovery server (best solution when you have many nodes and a lossy network)

In that specific case, the start-up time of the system is around 3 seconds, a good number in these scenarios.

We were waiting to have a full set of results with a comprehensive set of scenarios, to write a detailed article, but as this is a hot topic, we will write several articles while we complete the work, and we will update our configuration guides.

14 Likes

@Jaime_Martin_Losa

thanks for the information, that is really good to know.
one thing to confirm, I believe that those are Raspberry pi 4, right?

Hi @tomoyafujita

right

@Jaime_Martin_Losa: have you considered using something like tc to simulate adverse networking connections?

That would perhaps remove the need for real hw. Although I really like using real hw for this, it doesn’t facilitate replication of your results and tests by others.

It could also ease reusing the benchmark setup you have there and see how other DDS implementations cope.

@gavanderhoorn

To simulate adverse networking connections you first need to know what happen on a real scenario, and we are seeing all kind of surprises in this specific setup. For example, it depends a lot on the specific router, if you have both wired and wireless connections, the distance to the router, etc. When you simulate the scenario, everything is perfect, including the packet loss.

That is a cool pi farm.

And is multicast on as part of the Fast-RTPS “default” configuration for ROS2 users resulting in poor wifi performance of the ROS2 system out of the box? If so, that’s really what I’m trying to communicate in this post. I’m not trying to argue that one DDS implementation or another can work on Wifi, I’m trying to propose that all DDS implementations should come with configuration files / defaults on install such that they work out of the box to a reasonable degree on practical Wifi situations. We know this is a problem, by your quote here. There seems to be many solutions, but rather than describing the solutions, I would like to start the discussion that these are enabled, by default, so that most users don’t ever need to worry about it.

I repose the question from my original post:

My assertion is that we should make whatever changes each Tier-1 vendor deem appropriate (new defaults, ros2 configuration file, some development, whatever, I’m not trying to assert the method at which folks accomplish this goal) to enable good, reliable, consistent operations on wifi out of the box for users without having to know, understand, or care about DDS prematurely. Nothing screams immaturity of a technology like a basic demo failing on corporate wifi.


Now, if we’re going to get specific on tests, which is not the purpose of this post, lets at least make a real test that brings up the issues.

A typical system for mobile robotics is going to be ~50 nodes (and for now that’s 50 participants) per computer, each with 1-3 publishers and subscribers. Of that 50-150 data streams flying around, I’m going to throw out a number like 20-30% of those topics are going to represent images (1280x720) or pointclouds (407,040 points for a default config D435 camera) with some big data at 30hz. In terms of the number of actual computers, that can be lowered down to 3-4, but adding in traffic from 20-200 other non-ROS non-related devices running on the same network.

But your test is useful in getting started! You say when using the discovery server, is that enabled by default for users to use without having to know, care, understand, or otherwise worry about it to develop?

I’m more concerned with the on-going multicast traffic which is bogging down the entire system and creating havoc on people’s systems in practical Wifi situations steady state. Simple wifi situations aren’t good enough here - we’re not working on a theoretical robotics framework, we’re working on a production-ready one.

I can tell you anonymized from my inbox that Fast-RTPS does not meet the specifications of mobile robotics community out of the box. This isn’t something I can speak too intelligibly about, but @rotu I’m sure can add more color to this analysis. Running Fast-RTPS locally I have no issues with, but even on my home wifi network when trying to help debug issues from users I’ve hit similar limitations.

Hi @smac,

The advantage of the default DDS discovery mechanism is it works out of the box with no configuration, and it works in many situations, including wifi for simple cases. When you have a router that is not handling multicast properly (not all the routers expose this behavior), then I recommend to use alternative mechanisms, such as providing a list of peers, or using a discovery server (also unicast). This always will require some configuration as it depends on your network topology, but it is very easy to setup.

What you are describing, 50 mobile robots, transmitting HD images and point clouds over a saturated wifi, I would not say is a simple hello world robotic application, and some fine tuning for problematic wifi routers could be necessary.

So, to recap, we have 3 options for discovery:

1.- Default DDS Discovery Mechanism using multicast (default for all DDS implementations): No required configuration, works well in many cases, and the current behavior is optimized for lossy wifi networks.

2.- Default DDS Discovery Mechanism using unicast: Requires to provide a list of unicast peers (ip addresses)

3.- Discovery Server: Scales a lot better, as it is not required to have a conversation between every pair of peers, as in the default discovery. Support redundant discovery servers. Requires to provide the ip address of the discovery server.

The cases 2 and 3 are setup though a very simple XML file. What would you recommend to simplify further the configuration? Do you think is better to default the discovery server mechanism?

The user traffic is by default unicast. If you have a lossy network shared by many nodes, it is possible to reach problematic situations way before you reach the theoretical available bandwidth.

With the study we are conducting, we will provide information about different cases, not only for the discovery phase, but also for the user traffic in wide deployments, giving you some general recommendations on how to proceed in such cases.

If you recommend using a discovery server why is that unable to work out of the box? I think you’re missing or ignoring the point. I’m asking a thematic question about whether we as a ROS community value good non-theoretically-ideal wifi out-of-the-box support even if it degrades slightly the performance in other aspects.

That sentiment is the problem I’m trying to solve. From a roboticists perspective, I find this an unacceptable solution. You cannot ask a new user to configure or optimize their network topology to get a simple ROS2 demo working with their robot. If I buy a Fetch robot from Fetch Robotics, stick it in my cubicle at work, build ROS2 application and can’t get an HD image stream and a couple of pointclouds from the robot to my computer in rviz, that’s a critically impaired process. I find it acceptable at that point where you are deploying it in some massive industrialized setting to need to optimize performance. What you suggest there defines the following work flow for a “average user” I define above with the skill sets I describe above:

  • Learn ROS2 API
  • Get some robot to play with
  • Build their custom application
  • Run it and fail
  • Immediatelly hit the wall with ROS and then have to spend a week or more learning enough about DDS to know what dials to turn or learn about networking to a point to have a nuanced understanding of multicast, routers, unicast, discovery servers, etc.
  • Or they give up and spin their own UDP socket or regress back to ROS1, which is exactly what I would do in that situation.

Working out of the box in corporate wifi in steady state setting reasonably well is the minimum viable product of ROS2 that needs to be offered out of the box, in my opinion. This is what I’m referring to in the original post about treating wifi as a special case. This capability should be baked in on install, and if the solution a specific vendor implements doesn’t allow for that, perhaps the requirements that derived that solution should be reconsidered.

I’m describing a situation with 1 robot, with 50 nodes on 3 computers, which is a hello world case in robotics, perhaps not in networking, but we’re not a networking community, we’re a robotics community. Navigation uses 19 nodes by itself. Add sensor drivers, hardware drivers, user-space application, you hit 40-50 very quickly. There’s going to be a bunch of data flying around, HD images, 3D pointclouds are not exceptions to our needs, its the definition of them.

I really didn’t want to make this conversation pointed towards one vendor or another, but this is simply not true. That may be for your idealized network case with a few nodes with a router sitting right next to it with no other traffic but it is not the experience shared by multiple groups within Samsung, Rover, multiple groups in/around Intel, and myself. If we want to get granular, it seems whatever Cyclone’s SPDP option seems to fix this to a reasonable degree without any additional configuration / hardcoding of addresses. For them, my original posts’ intent would just be having that enabled by default or providing that in a default configuration file with ROS2 installs.

My goal still isn’t to talk about any particular vendor or any particular vendor’s solution to this problem, but in getting overall agreement that “yes, having good wifi support out of the box is something we find very important” or “no, having good wifi support isn’t worth it, we should instead provide accessible documentation”.

I’d be interested to hear @joespeed @tfoote @rotu jump in on their thoughts. But really, anyone that has an opinion about if they find this to be a problem for their needs or not to get some more outside input.

3 Likes

Hi @smac,

In Fast RTPS, you use multicast just for participant discovery (so “SPDP option” is not required, EDP is going to be always using unicast), and we have a very defensive defaults for this kind of networks. After Intel and others reported some problems over wifi on navigation, we reacted modifying the discovery behaviour, and changing the default parameters, some months ago, so that is already on place.

With those settings (now the default), our performance is really good. See this video from @rotu:

This is the case you are talking about, I think.

When you need to improve even more this behaviour, because multicast is not an option in your network, then you should use the discovery server or unicast discovery.

The Discovery server is included in Fast RTPS, any node can be a discovery server, you do not need anything more. And we can find easy configuration options. Right now, it is as simple as this:

https://fast-rtps.docs.eprosima.com/en/latest/use-cases.html#udpv4-example-setup

But we can even go further. I have no particular problem in making this discovery mechanism the default.

1 Like

Awesome, if other users in this thread come back and we have some agreement that better wifi support is something we want out of the box, lets make that the default configuration for your DDS then. In the mean-time, lets stop on the specifics of your product and let users and others express their thoughts on the topic I actually addressed in my post.

This is not an affront on Fast RTPS or advocating for Cyclone or really caring much at all about the implementation details under the hood of ROS2 RMW. I’m looking at evidence-based conclusions from external users trying to deploy mobile robotics applications using ROS2 and DDS and having issues that contacted me in trying to get navigation stack stuff working. Clearly, there is some issue or disconnect. I don’t know what it is and for the purposes of this discussion, its implementation details. The question at hand is more thematic that are we willing to make trade offs to support this and if this is as big of a problem as I feel it is.

I agree quite strongly with this. While any engineer knows that when you move to production you will need to spend time optimising and making things reliable, in a typical office using the office wifi is where prototyping most often begins, and that is when many foundational technology decisions are made, including “Should I use ROS 1, ROS 2, or my own solution?”.

2 Likes

It seems you already have some ideas – or at least have talked to some vendors about this @smac

Why don’t we make this more actionable and just list the options/approaches/settings that have worked/should work for the types of networks you have encountered difficulties with?

I don’t believe anyone here is going to say: “no, I don’t want wifi to work, I’d rather spent 3 weeks reading DDS documentation”.

So let’s get to it and start working on solutions.

So far, I’ve seen mentioned:

  • SPDP for Cyclone
  • Discovery server / unicast discovery for FastRTPS

Without listing these, there doesn’t seem to be a way to discuss whether they should be “the default”, as we’d be drawing a conclusion not based on technical facts, but based on a desire. And without knowing which options there are, we cannot determine whether they would also work sufficiently for non-wifi setups (as I would not know what to turn on/off).

There is a reason these are not the defaults in many DDS implementations, and it’s likely it is a good one. Whether it is a good one in the scenarios you’ve described remains to be seen.