I agree quite strongly with this. While any engineer knows that when you move to production you will need to spend time optimising and making things reliable, in a typical office using the office wifi is where prototyping most often begins, and that is when many foundational technology decisions are made, including “Should I use ROS 1, ROS 2, or my own solution?”.
It seems you already have some ideas – or at least have talked to some vendors about this @smac
Why don’t we make this more actionable and just list the options/approaches/settings that have worked/should work for the types of networks you have encountered difficulties with?
I don’t believe anyone here is going to say: “no, I don’t want wifi to work, I’d rather spent 3 weeks reading DDS documentation”.
So let’s get to it and start working on solutions.
So far, I’ve seen mentioned:
- SPDP for Cyclone
- Discovery server / unicast discovery for FastRTPS
Without listing these, there doesn’t seem to be a way to discuss whether they should be “the default”, as we’d be drawing a conclusion not based on technical facts, but based on a desire. And without knowing which options there are, we cannot determine whether they would also work sufficiently for non-wifi setups (as I would not know what to turn on/off).
There is a reason these are not the defaults in many DDS implementations, and it’s likely it is a good one. Whether it is a good one in the scenarios you’ve described remains to be seen.
I’m not the expert in this so I will abstain. I just know its a problem that needs to be solved from the experiences of many users trying to use out-of-the-box ROS2 in typical office/robotics environments. I want to rather build consensus that this is a problem to build a requirement around that we can then deliver to find a technical solution. I don’t want to find a technical solution looking for a problem. I think that each vendor will have their own solution, but partially the reason some work better than others is because they designed their technical solutions in the absence of actual feedback from robotics users. I think this is generally better practice. XYZ dials existing doesn’t mean we don’t actually need different dials.
I wouldn’t necessarily assume that. It may be true that they’re not default from the generic DDS implementation, but I don’t think there’s a particular reason those are the defaults for the ROS2 use-case of them.
Really, what I want to hear from are users of ROS2 if this is a problem they have. I want to build a general consensus before anything further. To me, this problem makes ROS2 DOA for mobile robotics users, which as @Katherine_Scott’s post shows is 50% of total ROS users. If I showed up off the street with the problems that exist from a clean ROS2 install, I’d immediately disregard ROS2 as an academic toy and move on.
Just to clarify, “SPDP” is already the default behavior for Fast RTPS.
@smac I 100% agree that the focus should be on good behavior by default, as well as anything else that reduces the barrier to entry.
@Jaime_Martin_Losa Thanks for sharing that bringup video. I learned a ton from making it, including performance implications of Wifi Muliticast (very bad, not just for the app that’s doing the multicasting but for my poor coworkers trying to use the network normally!) Also, I’m really glad that ADLINK and eProsima were able to use the findings to improve their product across the board!
That Pi setup is beautiful! Where can I find the test setup that you use on it?
Putting my 2 cents in this:
Yes, I , as a user, really believe this is a big problem.
I experienced multiple occasions where wifi was the biggest show stopper for ROS2.
University: should we switch teaching from ROS1 to ROS2? ROS1 is a big hassle regarding ROS_Master to setup for like 20 students that all want to control their own turtlesim on a projection in class. The wifi network can withstand that mostly to some point but than just crashes totally. This is more due to the local wifi setup and nothing regarding ROS.
After doing some research for ROS2 and coming across various posts on wifi issues for even 1 robot the decision to stay on ROS1 was immediately made. During such courses students mostly get a ready-made platform with some sensors to play with and then do all the programming on their own laptops. The concept of remote login via ssh is mostly unknown to them. Therefor a stable wifi connection out of the box is strongly required.
ROS2 training from Frauenhofer IPA: roughly 10 participants (50% academic, 50% industrial) were participating. Everyone really enjoyed all the new features of ROS2 (including Navigation2) and liked playing around with their Gazebo instances with a turtlebot inside. After we started to go onto the real hardware at more or less the same time the network crashed for everybody… I don’t think that this was the fault of a poor setup. It’s just hard to teach ROS2 in a 3 day crash course and then take a huge DDS configuration into account as well… But sadly after this the general opinion throughout academic and industry participants was just like this:
It really saddens me as I nevertheless will continue to push ROS2 in industrial environments. But especially those attending these courses were mostly sort of highly skilled and needed to report to their management if ROS2 would be an option for them. Again I don’t want to blame Frauenhofer IPA for this, as they did a really amazing job at preparing a really really cool workshop. But this is exactly the thing @gavanderhoorn pointed out:
Industrial: Nothing particular about wifi that comes to my mind right now. But I remembered that at least 2 research teams I came across last year sticked to ROS1 as they where more happy with the network-stack for their particular use-cases
Then I agree with you to for now first build concensus and then start looking at how to tackle this in the best way possible.
Good performance over WIFI and other kinds of “lossy” networks has always been one of the major pain-points that ROS 2 is supposed to address, and the use of DDS has been at least partially motivated by the promise that it would enable this. Nobody explicitly said “without major configuration hassle”, but I think this went without saying, as people were used to that with ROS 1.
Therefore, I very much support @smac’s initiative here and think easy to use, high performance networking is something we as a community need out-of-the-box.
My personal revelation came at ROSCon 2019 during the iceoryx presentation, when the presenters showed how badly image transport works without iceoryx (not at all smooth). I mean, iceoryx is great, particularly once you really start increasing up the throughput, but for a single image stream it should not be necessary, particularly not on localhost.
Since @gavanderhoorn has been asking for examples to test with, that would be my first suggestion. Now, I realize that “smooth image streaming” is not necessarily required for a robust and performant system, but it’s one of those “presentation” issues. People will notice it very rapidly when it’s not there. So, unless improving that reduces performance elsewhere, I think it’s a good test.
The other thing I noticed is that the navigation users seem to have more trouble than other people who may be using “simpler” setups that just stream a little bit of sensor data and a few commands (like many of my own simple test experiments).
Therefore, it may be that the number of nodes present is a limiting factor, at least as long as each node is mapped to a participant. This could also be translated into a test.
The last thing that comes to mind right now, even though its certainly not the last important thing, it seems as if discovery is a major issue, so we might want to look at performance after discovery and during discovery separately.
Thanks @smac for raising the visibility of this long-standing issue and to others for chiming in. Open Robotics is completely in agreement with your proposition that ROS 2 should work well out-the-box in common wifi environments, using defaults. If we can’t match ROS 1’s performance or ease-of-use in that setting, without special configuration from the user, then we’re doing something wrong.
As our team wraps up their work for the Foxy API freeze, we’re assigning some people to specifically investigate the wifi behavior problem over the next several weeks. We don’t know yet what the nature of a fix will be, but we’re hoping to see some material improvement in time for Foxy. We’ll be working closely with the vendors as we go, and we welcome help from all of you!
Exactly my sentiment, that sounds reasonable. Thanks for taking on the action item to build a requirement and figure out then what the solution is. I am chatting with colleagues in Korea about running some experiments and I’ll get back to you or post here with relevant results from their experiences. They have easier access to complex corporate wifi network environments than I have in my 1000 sqft apartment.
To be clear, I don’t want to give the impression that “we’ve got this; everybody else can just wait for the fix to be released.” While we are investing time and effort in the issue, we still want help from everyone who can contribute. I’ll defer to @dirk-thomas to link to the relevant ticket(s).
Seems like a good time to start the middleware working group to take up lower level issues such as these. It’s very likely that this type of issue has been discussed in OMG meetings which means those members would have the best insights into how to find the right balance.
Just a couple of cents more. You can expect a lot better experience out of the box in the Foxy release, as things have improved a lot in the meantime.
1.- Discovery over WIFI (@smac): A lot of improvements have been made and both the behavior and the defaults now are specifically tuned for this scenario.
Also, as I promise, here is the first article of a series of intensive experiments with the Raspberry Pi Farm:
This is available in Dashing, and optimized in foxy. As expected, it works very well for Wifi networks not supporting multicast. We are studying how to simplify even more the configuration required, but it could be as simple as specifying the master in ROS1: if you export an environment variable we could select that discovery mechanism, for example.
2.- Streaming and big data in the intra- process and inter- process case (@Ingo_Lutkebohle): This week we have released Fast RTPS 1.10 (already available), with a complete shared memory transport and optimized intra-process behavior. This feature is the first time is available in an open source implementation of DDS, and as expected decreased the latency and increases the throughput a lot for big messages, such as the used in video streaming.
We are now in the process of characterizing the performance, and we will publish some results soon, but as an example, for a message of 2m the latency decreases around 20 times in the case of inter-process.
This feature is enabled by default, so no configuration required, and it is available in all the ROS 2 supported platforms
3.- Benchmarking as part of our CI: We at eProsima are continuously doing a big effort to improve the performance and scalability having a dedicated team for performance and benchmarking tests as part of our CI. We usually add more scenarios and tests when customers or the community describes any performance issue, and we will stay tuned to keep improving.
Just to be clear: this version of Fast-RTPS is currently not being used by any ROS distro - not even master which will become Foxy.
Just to update this thread: Now shared memory transport will be available for Foxy.
Also, we have an updated discovery study here.
@Jaime_Martin_Losa thanks for the update!
btw, your latest documentation doesn’t seem to have the use-cases link anymore! So the links in this thread to the use-case documentation don’t work.
I had a look at the study and one thing struck me: You have 29 participants, with 10 endpoints each, which is a small network by robotic standards. And discovery traffic for this network, even in the very best case, causes ~40.000 packets to be exchanged, and close to 90.000 packets for whats currently the default case (serverless discovery).
No wonder the WIFI breaks down. Even without multicast issues, thats a lot of packages.
This really strikes me as ridiculous. Sure, if the goal would be a fully meshed network, with every participant talking to all the other participants, fine. That would be a lot of connections, and TCP would do worse to set it all up. But this is not what is usually happening. Most of our systems are very sparsely connected. The large majority of endpoints are only ever accessed by exactly one other participant. There’s a few exceptions of course (/tf comes to mind), but I would still say that the average is somewhere between 1 and 2.
I mean, ROS 2 fires up ~15 services per node just for parameters and the lifecycle! In most cases, nobody except launch ever accesses those. ROS 1 took advantage of this, nodes only asked the roscore for the topics and services that they actually needed.
There must be a way to take advantage of this fact to reduce discovery traffic for DDS as well.
+1, and I think we covered that somewhat in our discussion yesterday. There’s also lots of problems we’re seeing with services just in the localhost with dropped service calls (up to 50%) even without the wifi potentially due to this.
+1 on this.
From an analysis that we did last year, this fully connected discovery was also a major cause of RAM overhead.
We did some proof of concepts with fastrtps where we applied a whitelist: i.e. discovery was allowed only between entities that were matching in the whitelist.
Considering a 10 nodes system, RAM was reduced to almost 2/3.
I expect these numbers to not be accurate anymore as now with 1 participant per process the impact should be smaller.
However, given the advantages from many points of views (better discovery and performance) maybe is worth investigating again something similar?
@Jaime_Martin_Losa IIRC you had a branch with this feature, did this ever got into master?
The latest doc contains also the typical use cases here:
We have improved a lot the documentation for the latest release.