I’m really excited to announce our next blog post, featuring our OpenRover running ROS 2 with the Navigation stack! We did a comparison of different DDS implementations in a typical deployment over wifi, and recorded some fancyvideos.
I think the scenario you are testing is really complex, and it is hard to extract any conclusions regarding performance.
A couple of months ago you were recommending to switch to OpenSplice and this time OpenSplice often shows start-up times higher than 10 min, according to your blog post.
In such a complex system, with components that are right now on development and have problems using any DDS implementation (see for example https://github.com/ros2/rviz/issues/437), I think the best approach is try to isolate the issues and understand the reasons behind the numbers/hangs you are getting.
Because the symptoms you are describing it seems you are probably losing discovery multicast announcements. Multicast is not very reliable in some wifi routers, and take into account the discovery announcement period is 30 seconds by default in most implementations. This could lead to long discovery times. Could this issue also produce timeouts in your scenario? You said just re-starting the rviz solves the hang issue, and this could indicate that is the case.
In July, we contacted you through email offering help to debug these problems, and here I extend the invitation. Let’s have a meeting to discuss how to isolate/reproduce the issues.
It’s absolutely complex! OpenSplice Community 6.9 works great, and more reliably than Fast RTPS with a small number of nodes (at least at the time that post was authored). But introducing the ROS2 Navigation stack makes OSPL frustratingly slow and unreliable.
Here’s the issue writeup rmw_opensplice#282. Some attempted fixes are discussed in the GitHub issue, namely changing AllowMulticast, enabling AggressiveKeepLastWhc, and tuning high-water mark settings. Restarting RViz resolves bringup issues in Fast RTPS sometimes, but not in OpenSplice. After some analysis of Wireshark dumps, @eboassonconcluded that tuning probably won’t fix the issue. I welcome and will try any tuning suggestions - just shoot over a ospl.xml file. Better yet: submit a PR to the default config so it works better out of the box.
It does seem that the issues go away in OpenSplice 6.10.2p4, even with shared memory disabled, but that (1) is unavailable to install via apt or rosdep and (2) requires a commercial license. I’m hoping that a future release of OpenSplice Community provides a more usable foundation for ROS 2.
We have at this point 2 engineers of our core team working on characterize the performance on wifi networks of the multicast traffic, so we should have more information very soon. I will keep you posted.
If in this case Shared Mem transport could help, we have good news, we are developing already the shared memory transport, and the goal is to provide an initial version for our next release (Open Source, we have no commercial edition).
Good to hear! I’ll keep my recommendations up to date when that’s out.
What do you mean you have no commercial edition? What is ADLINK Vortex OpenSplice 6.10.2p4 if not a commercial variant of Vortex OpenSplice Community Edition?
Oops! I misunderstood and thought you were with ADLINK! I’m sorry for the confusion. I can’t find any emails from eProsima but I’ll be very happy to give any diagnostic info and try out any suggestions for tuning Fast RTPS. Feel free to email me at dan@digilabs.io.
I dropped the ball on responding to your email on July 31st. I apologize for that. Glad you and Dan could connect here. Dan is our local expert on DDS stuff so he is the person to be in contact with. I’d love to see all the DDS’s in the sub 10s bring up group. Lets see if we can make that happen.
Good news, we have been working with RoverRobotics, and now our bring-up time is around 10-15 seconds, that is probably the minimum time required because the timing of their system.
It seems some routers can cause problems when sending and listening to multicast through different interfaces (for example wifi and cable). See this issue for a full explanation:
We provided a fix as a workaround, and now is already in Dashing if you build from sources, and will be incorporated to Dashing next sync very soon.
@ruffsl could you check if this solves your issue too?
Please note that we’ve backported the relevant DDSI enhancements in our commercial 6.10 release (w.r.t. retransmission of fragments) to our 6.9 community-edition … and are awaiting some feedback w.r.t. those resolving the issues here (upon which we’ll also update the pre-built binaries on the community-website) …