FastDDS without Discovery Server?

I’ve been struggling with FastDDS (on ROS2 Humble) in several ways - but the biggest issue has been discovery.

With my robot I have one launch file for the drivers, a second for localization, a third for navigation - all of them run on the robot computer. If I start them all up in series, things work - but then say I have to restart navigation to change a configuration - about 75% of the time, it fails to connect certain topics (especially tf seems to be an issue), and then I have to restart the drivers launch file - I will note that I think the issue is discovery since existing connections (e.g localization nodes to drivers) continue to function. Switching to CycloneDDS also makes the problem go away.

As I started to research solutions to this - everything seems to suggest the “solution” is the Discovery Server:

Is it really the case that one of the more touted “features” for ROS2, the lack of a single point of failure (rosmaster), doesn’t really work with larger systems and the default DDS?

Or is it that the out of the box configuration doesn’t work and I need to configure FastDDS in some particular way?

Looking for feedback/help here - since according to the Technical Evaluation Reports half of respondents say they like FastDDS over CycloneDDS - but I haven’t been able to get things to work well.

Thanks,
-Fergs

P.S. I know support is supposed to go towards answers.ros.org - but I feel like there is no “single answer” here - I’m really hoping this turns into a discussion of how people are actually getting FastDDS to work in larger systems.

P.P.S. If you are using the Discovery Server in a larger system because it is the right solution, I’d also be interested to hear that.

13 Likes

Even if that’s the answer, that should be handled as the default profiles and not to the user. What you’re trying to do is pretty fundamental/basic - I’m shocked that there’s a problem with Fast-DDS in that.

I use Cyclone for most things involving hardware robots, I’ve found anecdotally that its more stable in bringup / regular service calls, but its been awhile since I seriously looked at Fast-DDS. Now that its default, I’ll start to dog food it more b/c I need to support Nav2 users, but this isn’t a great first impression for the mobile robotics community.

I’m also starting now to see a ramp up of RMW/DDS related tickets in Nav2 which I haven’t seen since the Foxy days, last time Fast-DDS was the default, which is troubling that we’ve potentially regressed in out-of-the-box behavior, even if the TSC RMW report metrics didn’t capture that information. It’s also worth noting an RMW issue with Fast-DDS reported: Subscriber which is created with dedicated callback group at runtime not working · Issue #613 · ros2/rmw_fastrtps · GitHub which is a pretty critical regression for my particular corner of the world.

3 Likes

Very interesting timing guys.

The team behind Fast DDS is pretty easy to reach if you have a reproducible problem. Why didn’t you post an issue? I don’t see any real questions here. Many users do meet us and comment with us on their architectures. Why didn’t you try that channel? Github? call us?

With Foxy, Galactic, and now Humble, we have around 50K clones per month of our main Fast DDS repo. Also, our commercial customers are growing at a steady rate. We have many users, and most of them, either using ROS 2 or DDS directly, are happy users.

Unfortunately, in some cases we can have a bug, or other implementations could perform better for specific use cases. Because in middleware, all the decisions have a trade-off, and sure we are not the best for all potential use cases.

Currently, the process to select the RMW default is a transparent one: Technical Reports and a vocation by the members of the TSC, all of them heavy users of ROS 2 and big contributors. @smac thinks cyclone is better, and he probably voted for cyclone. But we have also very heavy users of ROS 2 in favor of Fast DDS, using ROS 2 and Fast DDS in production systems and commercial robots, and we won the votation. And I like that: Votes based on reproducible technical arguments and unbiased measurements, and democracy.

My team is very open to discussing any discovery scenarios. We even have a really large study on scalability you can check here:

Fast DDS Discovery Mechanisms Analysis

If you want solutions, discuss technically your scenario, have a meeting with us, etc. It is easy to find us. Just drop me a message. Or if you want a public discussion, a reproducible issue is a good starting point. We are always willing to improve our implementation.

I think I’m pretty clear that as Fast-DDS is now the default, I want to make sure any odds and ends are handled so that Nav2 and mobile robotics users in ROS 2 get a good experience, as I should hope everyone would be on the same page about.

No one’s suggestion any radical measures. This is a forum for discussing potential issues and gathering feedback, that’s all that’s happening here.

3 Likes

Hi @smac,

I think it would be good to be really open about how the default is chosen, the technical reports, benchmarks, etc. And being as transparent as possible on this, rather than a selection based on political/commercial arguments, or who shouts more.

Many heavy ROS 2 users have strong technical opinions, and I think that is a good thing. I respect yours, and seriously, I would like to have you on board and happy with us. We are taking care of the issues you are mentioning in your post.

It is also important to note, ROS 2 Humble is just released. Big changes, new features, new default… so give it some weeks to adjust: The first sync/patch was 3 days ago. Humble, as Foxy, is an LTS release, and for us is a key priority to solve any outstanding issues, as it is going to be around some years…

I kinda thought that was what I was doing here - given that I’m not using this in some big fancy, commercial product, my expectation is that I would be relying on community support. I wasn’t even sure where I would even post an issue (the rmw implementation, FastDDS itself?). Given that this is more of a “systems” issue, I don’t have like a minimal reproducible example to share.

My question here really is: I’m looking for insight into how to get a larger system (25+ nodes) working since I’m not having much luck. I’m not even entirely sure where to start. I was hoping some of your many users might be able to point at a resource/post that says “hey, this is how we made things really solid” - as so far everything seems to point to the Discover Server (which seems counterintuitive given the marketing around ROS2 the last few years, but maybe that is the right answer and I should just go use that).

My linking to the RMW report was more that it told me that half of users out there ARE getting this to work very reliably - I want to know their tricks/tips.

3 Likes

Hi Mike

25+ nodes, is not a really big system. We have production deployments way larger, some of them using the default discovery, and some of them using the discovery server.

I am writing you a private message proposing to you some slots to have a meeting to understand better your system, and if possible try to reproduce it. Sounds like a plan?

2 Likes

@Jaime_Martin_Losa , I think that @mikeferguson and @smac have expressed their opinion in a very constructive way and your response is being a little defensive.

I know directly a company that is struggling with the very same problem (changing DDS vendor did not help them, though).

FastDDS has proven to have many desirable qualities and it was selected for that reason. Now that we acknowledged that (again) shall we focus on this very practical (and serious) concern?

ROS has a great community and very often problems (and solutions) are discussed openly. There is nothing wrong with that, as long it is done politely.

Saying that the problem doesn’t exist is not the best strategy, in my opinion. May be the best solution is as simple as improving the documentation or make existing information more visible.

Said that… I am not interested to personally join this debate any further, until I start using Humble myself :smile:

9 Likes

Hi @facontidavide

Publish-Subscribe architecture with no brokers comes with pros and cons. Better latency and throughput because you don’t need to route the information through a broker & automatic and plug-and-play discovery.

But P2P discovery means the required number of messages is of the order of the square of the number of peers, so it comes to a price when you have a lot of nodes. You can mitigate the effect by a good implementation of the process, but when you have a lot of nodes (25 is not a lot), the discovery could be very chatty. We don’t deny that, and that is why for this and other situations we have the discovery server.

The discovery server is based on the same building topics of the DDS standard, and it does not add any complexities. Moreover, we are talking at the OMG about how to incorporate this mechanism into the DDS/RTPS specs. It is not a Single Point of Failure, as you can have several redundant discovery servers.

Yes, discovery problems in large systems can happen: it is not a particular implementation, it is not DDS, it is the very nature of the architecture. And then, you need to fine-tune your system, use tools such as the Discovery Server, or just talk to us.

I want Fast-DDS to be successful as I want anything that ships with ROS to be successful. When issues are brought up, they’re not critiques as reasons not to use Fast-DDS, they’re issues that need to be dealt with so that users may effectively use Fast-DDS. At the moment, at least in my niche, Fast-DDS has some serious issues but they are problems I assume can be solved and I have full faith will be solved. I’m simply addressing them so that they may have awareness to be solved.

If anything, now more than ever I want Fast-DDS to be successful and work through these issues because its the default face that users will see out-of-the-box (and I’m sure from our discussions & comments above you recall how much I value that). Just as I’ve been asked in the past to specifically call out other DDS vendors in my documentation as “use this” or “officially supported”, I’ve refused because I’m not pro-DDS_A or anti-DDS_B. I’m pro-things-just-working so I can focus on making the best navigation system around :slight_smile:

But regardless, is this not a channel like any others (GitHub tickets, emails, messenger pigeons, and Discourse)? I think this is as fair of a place for discussion as any.

8 Likes

My personal experience speaking from a company with little resources to spend nor expertise on DDS configuration:
We started our work in ROS2 galactic using Fast DDS. At the time we had very frustrating issues about services not responding, not being discovered or answering with a huge latency + some high CPU usage (all of that was reported). And it was NOT a distributed system: everything was running on a powerful x86 computer under Ubuntu 20.04 (cannot be more standard). Nothing too fancy: customized nav2 stack, 3D pointcloud processing pipeline using composition to avoid serialization, and hardware interfaces.
All our problems magically disappeared when switching to Cyclone DDS.
In retrospect, the highest cost of switching from ROS1 to ROS2 for us was to tackle DDS related issues (another example: localhost only that needs multicast enabled on the loopback interface to work and how to activate it, i.e. “ip link set lo multicast on”, that information was hard to find).
From what I read in this post, it seems that there are still some instabilities on the core ros functionalities (publish/subscribe/topic/service/action) linked with the default DDS vendor change. In that sense (node-to-node communication on a single computer), ROS2 is still not iso with ROS1. In ROS1 it just works.

My feedback as a “naive” ros user: I believe there are some quality system tests missing targeting DDS functionalities. It is not normal that the issues we experienced under Galactic about the services went unnoticed. Same for current issues affecting subscription with callback group under Humble. If it is a matter of DDS configuration, then the default configuration should work out of the box at least on a standard system (x86 Ubuntu LTS) for standard ROS interface (publish/subscribe/topic/service/action). I don’t think it is acceptable to wait for an non rolling release to test and iterate on these issues. I can totally understand the need to dig into the configuration for exotic use cases, but let’s remember that the vast majority of users runs ROS on a single computer without having to worry about network latency, QoS and packet loss between nodes.
I am willing to help to describe basic test use cases if needed.

14 Likes

Our system-level tests, where we test out various combinations of topics, services, etc across nodes lives at GitHub - ros2/system_tests . If you’d like to open issues (or even better, pull requests) for additional tests, we’d be happy to review them.

3 Likes

Similarly, not once did we have issues with discovery or service calls on CycloneDDS on galactic. Everything rmw related just worked automagically. But as shared memory support in FastDDS is more mature and more out of the box, we have turned to FastDDS for performance gains. However, we have encountered not just one, but many issues with basic behavior in FastDDS in a very short amount of time after migrating, first on Galactic, and then on Humble. Examples:

EDIT: I have also not had much success with shared memory support in CycloneDDS via iceoryx. But it should be noted that some (but not all) of these FastDDS issues are not even shared memory specific (which is enabled by default, but can be disabled).

4 Likes

I’m out of town this week, so not much progress on resolving my issues - but I wanted to post a quick update on what has happened since I initially posted:

  • I had a call with eProsima and they suggested a few debugging tips:
  • They noted that there is a “keep alive” timeout, so I am avoiding shutting down nodes and then restarting them immediately (waiting at least a minute to make sure the connections have timed out)
  • They noted that there is a “fastdds shmem clear” command which can clear the shared memory in case there is something left residually. This did not help with my issue.
  • If you’re using the FastDDS monitor tool - it will only show one participant per process, so the number of participants will be different than “ros2 node list” displays.

With some additional testing, I found an interesting issue:

  • I had been running “ros2 node list” to see if things were out there - and it doesn’t seem to report all the nodes (which was why I assumed this was a discovery issue)
  • HOWEVER, this time around I decided to test running "ros2 topic echo /tf"even when “node list” doesn’t return everything, and that WORKED - I was definitely getting all three sets of TF data I would expect (map->odom, odom->base, everything else from robot_state_publisher).
  • So… it appears this might actually be something slightly different than a pure discovery issue (since the CLI tools can connect but not navigation).
1 Like

I don’t know if it is directly related, but when I am shutting down and running nodes lots of times, sometimes “ros2 topic list” fails to print all of the topics available, and “ros2 topic list --no-daemon” always shows all the topics. If that is the case for you, maybe we should consider adding “–no-daemon” option to “ros2 node list”.

2 Likes

I think this topic is too important to be closed after 30 days from the last comment :slight_smile:.

@mikeferguson do you happen to have any follow-up thoughts on these issues?