30 Raspberry pi connected to wifi networks, using different routers, and running different cases over wireless and wire setups, to benchmark everything. In a single Raspberry pi we can have multiple ROS2/DDS participants, with many publishers and subscribers.
Everything is automated, and we can run any test just with a single command. We are preparing an article with the conclusions.
In short: It is true that many wifi scenarios have a lot of packet loss, especially when using multicast, and this is more noticeable when you have many nodes on the same wifi network. But in simple wifi scenarios, ROS2 works out of the box.
We have written some tutorials on how to setup Fast RTPS when you notice such problems as an initial guide:
and we have already some good data, in complex scenarios, for example, 30 Raspberry PIs, 1 Participant per Raspberry PI, with 5 publishers and 5 subscribers, using discovery server (best solution when you have many nodes and a lossy network)
In that specific case, the start-up time of the system is around 3 seconds, a good number in these scenarios.
We were waiting to have a full set of results with a comprehensive set of scenarios, to write a detailed article, but as this is a hot topic, we will write several articles while we complete the work, and we will update our configuration guides.
To simulate adverse networking connections you first need to know what happen on a real scenario, and we are seeing all kind of surprises in this specific setup. For example, it depends a lot on the specific router, if you have both wired and wireless connections, the distance to the router, etc. When you simulate the scenario, everything is perfect, including the packet loss.
And is multicast on as part of the Fast-RTPS “default” configuration for ROS2 users resulting in poor wifi performance of the ROS2 system out of the box? If so, that’s really what I’m trying to communicate in this post. I’m not trying to argue that one DDS implementation or another can work on Wifi, I’m trying to propose that all DDS implementations should come with configuration files / defaults on install such that they work out of the box to a reasonable degree on practical Wifi situations. We know this is a problem, by your quote here. There seems to be many solutions, but rather than describing the solutions, I would like to start the discussion that these are enabled, by default, so that most users don’t ever need to worry about it.
I repose the question from my original post:
My assertion is that we should make whatever changes each Tier-1 vendor deem appropriate (new defaults, ros2 configuration file, some development, whatever, I’m not trying to assert the method at which folks accomplish this goal) to enable good, reliable, consistent operations on wifi out of the box for users without having to know, understand, or care about DDS prematurely. Nothing screams immaturity of a technology like a basic demo failing on corporate wifi.
Now, if we’re going to get specific on tests, which is not the purpose of this post, lets at least make a real test that brings up the issues.
A typical system for mobile robotics is going to be ~50 nodes (and for now that’s 50 participants) per computer, each with 1-3 publishers and subscribers. Of that 50-150 data streams flying around, I’m going to throw out a number like 20-30% of those topics are going to represent images (1280x720) or pointclouds (407,040 points for a default config D435 camera) with some big data at 30hz. In terms of the number of actual computers, that can be lowered down to 3-4, but adding in traffic from 20-200 other non-ROS non-related devices running on the same network.
But your test is useful in getting started! You say when using the discovery server, is that enabled by default for users to use without having to know, care, understand, or otherwise worry about it to develop?
I’m more concerned with the on-going multicast traffic which is bogging down the entire system and creating havoc on people’s systems in practical Wifi situations steady state. Simple wifi situations aren’t good enough here - we’re not working on a theoretical robotics framework, we’re working on a production-ready one.
I can tell you anonymized from my inbox that Fast-RTPS does not meet the specifications of mobile robotics community out of the box. This isn’t something I can speak too intelligibly about, but @rotu I’m sure can add more color to this analysis. Running Fast-RTPS locally I have no issues with, but even on my home wifi network when trying to help debug issues from users I’ve hit similar limitations.
The advantage of the default DDS discovery mechanism is it works out of the box with no configuration, and it works in many situations, including wifi for simple cases. When you have a router that is not handling multicast properly (not all the routers expose this behavior), then I recommend to use alternative mechanisms, such as providing a list of peers, or using a discovery server (also unicast). This always will require some configuration as it depends on your network topology, but it is very easy to setup.
What you are describing, 50 mobile robots, transmitting HD images and point clouds over a saturated wifi, I would not say is a simple hello world robotic application, and some fine tuning for problematic wifi routers could be necessary.
So, to recap, we have 3 options for discovery:
1.- Default DDS Discovery Mechanism using multicast (default for all DDS implementations): No required configuration, works well in many cases, and the current behavior is optimized for lossy wifi networks.
2.- Default DDS Discovery Mechanism using unicast: Requires to provide a list of unicast peers (ip addresses)
3.- Discovery Server: Scales a lot better, as it is not required to have a conversation between every pair of peers, as in the default discovery. Support redundant discovery servers. Requires to provide the ip address of the discovery server.
The cases 2 and 3 are setup though a very simple XML file. What would you recommend to simplify further the configuration? Do you think is better to default the discovery server mechanism?
The user traffic is by default unicast. If you have a lossy network shared by many nodes, it is possible to reach problematic situations way before you reach the theoretical available bandwidth.
With the study we are conducting, we will provide information about different cases, not only for the discovery phase, but also for the user traffic in wide deployments, giving you some general recommendations on how to proceed in such cases.
If you recommend using a discovery server why is that unable to work out of the box? I think you’re missing or ignoring the point. I’m asking a thematic question about whether we as a ROS community value good non-theoretically-ideal wifi out-of-the-box support even if it degrades slightly the performance in other aspects.
That sentiment is the problem I’m trying to solve. From a roboticists perspective, I find this an unacceptable solution. You cannot ask a new user to configure or optimize their network topology to get a simple ROS2 demo working with their robot. If I buy a Fetch robot from Fetch Robotics, stick it in my cubicle at work, build ROS2 application and can’t get an HD image stream and a couple of pointclouds from the robot to my computer in rviz, that’s a critically impaired process. I find it acceptable at that point where you are deploying it in some massive industrialized setting to need to optimize performance. What you suggest there defines the following work flow for a “average user” I define above with the skill sets I describe above:
Learn ROS2 API
Get some robot to play with
Build their custom application
Run it and fail
Immediatelly hit the wall with ROS and then have to spend a week or more learning enough about DDS to know what dials to turn or learn about networking to a point to have a nuanced understanding of multicast, routers, unicast, discovery servers, etc.
Or they give up and spin their own UDP socket or regress back to ROS1, which is exactly what I would do in that situation.
Working out of the box in corporate wifi in steady state setting reasonably well is the minimum viable product of ROS2 that needs to be offered out of the box, in my opinion. This is what I’m referring to in the original post about treating wifi as a special case. This capability should be baked in on install, and if the solution a specific vendor implements doesn’t allow for that, perhaps the requirements that derived that solution should be reconsidered.
I’m describing a situation with 1 robot, with 50 nodes on 3 computers, which is a hello world case in robotics, perhaps not in networking, but we’re not a networking community, we’re a robotics community. Navigation uses 19 nodes by itself. Add sensor drivers, hardware drivers, user-space application, you hit 40-50 very quickly. There’s going to be a bunch of data flying around, HD images, 3D pointclouds are not exceptions to our needs, its the definition of them.
I really didn’t want to make this conversation pointed towards one vendor or another, but this is simply not true. That may be for your idealized network case with a few nodes with a router sitting right next to it with no other traffic but it is not the experience shared by multiple groups within Samsung, Rover, multiple groups in/around Intel, and myself. If we want to get granular, it seems whatever Cyclone’s SPDP option seems to fix this to a reasonable degree without any additional configuration / hardcoding of addresses. For them, my original posts’ intent would just be having that enabled by default or providing that in a default configuration file with ROS2 installs.
My goal still isn’t to talk about any particular vendor or any particular vendor’s solution to this problem, but in getting overall agreement that “yes, having good wifi support out of the box is something we find very important” or “no, having good wifi support isn’t worth it, we should instead provide accessible documentation”.
I’d be interested to hear @joespeed@tfoote@rotu jump in on their thoughts. But really, anyone that has an opinion about if they find this to be a problem for their needs or not to get some more outside input.
In Fast RTPS, you use multicast just for participant discovery (so “SPDP option” is not required, EDP is going to be always using unicast), and we have a very defensive defaults for this kind of networks. After Intel and others reported some problems over wifi on navigation, we reacted modifying the discovery behaviour, and changing the default parameters, some months ago, so that is already on place.
With those settings (now the default), our performance is really good. See this video from @rotu:
This is the case you are talking about, I think.
When you need to improve even more this behaviour, because multicast is not an option in your network, then you should use the discovery server or unicast discovery.
The Discovery server is included in Fast RTPS, any node can be a discovery server, you do not need anything more. And we can find easy configuration options. Right now, it is as simple as this:
Awesome, if other users in this thread come back and we have some agreement that better wifi support is something we want out of the box, lets make that the default configuration for your DDS then. In the mean-time, lets stop on the specifics of your product and let users and others express their thoughts on the topic I actually addressed in my post.
This is not an affront on Fast RTPS or advocating for Cyclone or really caring much at all about the implementation details under the hood of ROS2 RMW. I’m looking at evidence-based conclusions from external users trying to deploy mobile robotics applications using ROS2 and DDS and having issues that contacted me in trying to get navigation stack stuff working. Clearly, there is some issue or disconnect. I don’t know what it is and for the purposes of this discussion, its implementation details. The question at hand is more thematic that are we willing to make trade offs to support this and if this is as big of a problem as I feel it is.
I agree quite strongly with this. While any engineer knows that when you move to production you will need to spend time optimising and making things reliable, in a typical office using the office wifi is where prototyping most often begins, and that is when many foundational technology decisions are made, including “Should I use ROS 1, ROS 2, or my own solution?”.
It seems you already have some ideas – or at least have talked to some vendors about this @smac
Why don’t we make this more actionable and just list the options/approaches/settings that have worked/should work for the types of networks you have encountered difficulties with?
I don’t believe anyone here is going to say: “no, I don’t want wifi to work, I’d rather spent 3 weeks reading DDS documentation”.
So let’s get to it and start working on solutions.
So far, I’ve seen mentioned:
SPDP for Cyclone
Discovery server / unicast discovery for FastRTPS
Without listing these, there doesn’t seem to be a way to discuss whether they should be “the default”, as we’d be drawing a conclusion not based on technical facts, but based on a desire. And without knowing which options there are, we cannot determine whether they would also work sufficiently for non-wifi setups (as I would not know what to turn on/off).
There is a reason these are not the defaults in many DDS implementations, and it’s likely it is a good one. Whether it is a good one in the scenarios you’ve described remains to be seen.
I’m not the expert in this so I will abstain. I just know its a problem that needs to be solved from the experiences of many users trying to use out-of-the-box ROS2 in typical office/robotics environments. I want to rather build consensus that this is a problem to build a requirement around that we can then deliver to find a technical solution. I don’t want to find a technical solution looking for a problem. I think that each vendor will have their own solution, but partially the reason some work better than others is because they designed their technical solutions in the absence of actual feedback from robotics users. I think this is generally better practice. XYZ dials existing doesn’t mean we don’t actually need different dials.
I wouldn’t necessarily assume that. It may be true that they’re not default from the generic DDS implementation, but I don’t think there’s a particular reason those are the defaults for the ROS2 use-case of them.
Really, what I want to hear from are users of ROS2 if this is a problem they have. I want to build a general consensus before anything further. To me, this problem makes ROS2 DOA for mobile robotics users, which as @Katherine_Scott’s post shows is 50% of total ROS users. If I showed up off the street with the problems that exist from a clean ROS2 install, I’d immediately disregard ROS2 as an academic toy and move on.
@smac I 100% agree that the focus should be on good behavior by default, as well as anything else that reduces the barrier to entry.
@Jaime_Martin_Losa Thanks for sharing that bringup video. I learned a ton from making it, including performance implications of Wifi Muliticast (very bad, not just for the app that’s doing the multicasting but for my poor coworkers trying to use the network normally!) Also, I’m really glad that ADLINK and eProsima were able to use the findings to improve their product across the board!
That Pi setup is beautiful! Where can I find the test setup that you use on it?
Yes, I , as a user, really believe this is a big problem.
I experienced multiple occasions where wifi was the biggest show stopper for ROS2.
University: should we switch teaching from ROS1 to ROS2? ROS1 is a big hassle regarding ROS_Master to setup for like 20 students that all want to control their own turtlesim on a projection in class. The wifi network can withstand that mostly to some point but than just crashes totally. This is more due to the local wifi setup and nothing regarding ROS.
After doing some research for ROS2 and coming across various posts on wifi issues for even 1 robot the decision to stay on ROS1 was immediately made. During such courses students mostly get a ready-made platform with some sensors to play with and then do all the programming on their own laptops. The concept of remote login via ssh is mostly unknown to them. Therefor a stable wifi connection out of the box is strongly required.
ROS2 training from Frauenhofer IPA: roughly 10 participants (50% academic, 50% industrial) were participating. Everyone really enjoyed all the new features of ROS2 (including Navigation2) and liked playing around with their Gazebo instances with a turtlebot inside. After we started to go onto the real hardware at more or less the same time the network crashed for everybody… I don’t think that this was the fault of a poor setup. It’s just hard to teach ROS2 in a 3 day crash course and then take a huge DDS configuration into account as well… But sadly after this the general opinion throughout academic and industry participants was just like this:
It really saddens me as I nevertheless will continue to push ROS2 in industrial environments. But especially those attending these courses were mostly sort of highly skilled and needed to report to their management if ROS2 would be an option for them. Again I don’t want to blame Frauenhofer IPA for this, as they did a really amazing job at preparing a really really cool workshop. But this is exactly the thing @gavanderhoorn pointed out:
Industrial: Nothing particular about wifi that comes to my mind right now. But I remembered that at least 2 research teams I came across last year sticked to ROS1 as they where more happy with the network-stack for their particular use-cases
Good performance over WIFI and other kinds of “lossy” networks has always been one of the major pain-points that ROS 2 is supposed to address, and the use of DDS has been at least partially motivated by the promise that it would enable this. Nobody explicitly said “without major configuration hassle”, but I think this went without saying, as people were used to that with ROS 1.
Therefore, I very much support @smac’s initiative here and think easy to use, high performance networking is something we as a community need out-of-the-box.
My personal revelation came at ROSCon 2019 during the iceoryx presentation, when the presenters showed how badly image transport works without iceoryx (not at all smooth). I mean, iceoryx is great, particularly once you really start increasing up the throughput, but for a single image stream it should not be necessary, particularly not on localhost.
Since @gavanderhoorn has been asking for examples to test with, that would be my first suggestion. Now, I realize that “smooth image streaming” is not necessarily required for a robust and performant system, but it’s one of those “presentation” issues. People will notice it very rapidly when it’s not there. So, unless improving that reduces performance elsewhere, I think it’s a good test.
The other thing I noticed is that the navigation users seem to have more trouble than other people who may be using “simpler” setups that just stream a little bit of sensor data and a few commands (like many of my own simple test experiments).
Therefore, it may be that the number of nodes present is a limiting factor, at least as long as each node is mapped to a participant. This could also be translated into a test.
The last thing that comes to mind right now, even though its certainly not the last important thing, it seems as if discovery is a major issue, so we might want to look at performance after discovery and during discovery separately.
Thanks @smac for raising the visibility of this long-standing issue and to others for chiming in. Open Robotics is completely in agreement with your proposition that ROS 2 should work well out-the-box in common wifi environments, using defaults. If we can’t match ROS 1’s performance or ease-of-use in that setting, without special configuration from the user, then we’re doing something wrong.
As our team wraps up their work for the Foxy API freeze, we’re assigning some people to specifically investigate the wifi behavior problem over the next several weeks. We don’t know yet what the nature of a fix will be, but we’re hoping to see some material improvement in time for Foxy. We’ll be working closely with the vendors as we go, and we welcome help from all of you!
Exactly my sentiment, that sounds reasonable. Thanks for taking on the action item to build a requirement and figure out then what the solution is. I am chatting with colleagues in Korea about running some experiments and I’ll get back to you or post here with relevant results from their experiences. They have easier access to complex corporate wifi network environments than I have in my 1000 sqft apartment.