Fast DDS features an alternative to the DDS Simple Discovery Protocol (SDP) mechanism: Discovery Server.
Discovery Server is a mechanism designed for large deployments with many nodes, reduces the discovery related network traffic while avoiding typical issues of heterogeneous networks such as package loss over Wifi, or multicast drops in network equipment.
eProsima Fast DDS v2.0.2 further reduces the discovery related traffic when using the Discovery Server by only connecting those Clients which have something to say to each other.
How does it do it? The Discovery Server does it by replacing the standard peer to peer discovery by one or several Discovery Servers to which the Clients connect.
This strategy has reduced the network traffic by up to 93% when compared to the standard SDP, and by up to 83% when compared to the previous implementation of the Discovery Server.
That sounds awesome! Anything dropping my 93% is a major accomplishment, congrats! While I totally recognize that less network traffic means things are improving substantially, but can you provide some metrics as this might affect an end-user’s ROS 2 application? E.g. what change will a typical ROS 2 user see when using this change and Fast-DDS (less CPU, less latency, less dropped messages, etc)? Having a “bottom line” understanding might help people make system-level decisions based on the new information.
Is this or should this be default enabled? Not a loaded question, just curious what the state of that discussion is.
Switching from the DDS (and ROS 2) default Simple Discovery Protocol (SDP) to eProsima’s Discovery Server (DS) will only have an effect in the discovery related traffic. Briefly, discovery comprises two phases: Participant Discovery Phase (PDP), and Endpoint Discovery Phase (EDP). Translated into ROS 2 Foxy, PDP is the phase where contexts discover each other (since Foxy, the context is the one holding the DDS participant, before it was held on the node), whereas EDP is where the endpoints (ROS 2 publications and subscriptions) discover each other.
The reason why SDP entails so much network traffic is because, in order to have dynamic discovery, PDP utilizes multicast, which means that all the participants (contexts) in the same domain discover each other at PDP level. Once PDP is finished, the participants (contexts) exchange information about all the endpoints (publications/subscriptions) they have. This is a reliable communication built on top of UDP, which means that the EDP traffic is not only the endpoints’ information itself, but also meta-traffic for the reliability mechanism (heartbeats, acknacks, etc.). Each ROS 2 node features 8-10 (not sure about the concrete number) builtin topics, which means 8-10 publications and 8-10 subscriptions, i.e. 16-20 endpoints in addition to the “user” endpoints. Because of SDP, all these endpoints are distributed to all other participants (contexts), thus making SDP relatively not scalable (specially in ROS 2).
The new version of the Discovery Server (DS) reduces the traffic up to 93% by the means of a Server that centralizes all the PDP/EDP traffic (meaning Clients only send their discovery packets to the Server). It is the Server’s job to redistribute the information, with the particularity that it only connects Clients that have something to say to each other, i.e. they share a topic and have one publication and one subscription in it respectively.
SDP suffers from a further problem, which is that it entirely depends on multicast for the PDP phase. This makes SDP burdensome to use in networks where multicast does not work reliably, such as WiFi networks, complex networks with several links between publishers and subscribers, and so on. This is why DS limits the communication to unicast, which entails that Clients need to know where the Server is beforehand.
ROS 2 users switching to DS will simply experience discovery working in scenarios where it did not before (WiFi, complex networks, lower bandwidth networks, etc.). It’s a recurring issue which is only growing with the ROS 2 growth and the raise of collaborative robotics and such. Keep in mind that discovery protocols do not affect the user’s data delivery, so latency, throughput, loss rate, CPU usage, etc. are unaffected. The vast majority of the discovery traffic occurs in the very begging, so it does not affect the normal operation of applications. Bottom line is: use SDP as long as it works for you, but as soon as you run into some problem, do not waste time checking hardware configuration or avoiding WiFi, etc. just put an environment variable to configure your clients, run a server instance in a separate terminal, and you’re good to go!
DS is not the default Discovery Protocol in Fast DDS as the DDS standard does not allow that. Making it the default in ROS 2 would mean that end-users would always need to instantiate the Server (BTW, any participant can be configured as Server, we just made a cli tool to ease that). I think that being as easy configurable as it is, making the switch is a matter of seconds, which means that applications can very easily make it “their default”. Mark that DS is a Fast DDS feature that is not available in any other middle-ware implementation. In any case, we have issued a PR to ros-documentation to make a full tutorial on how to use DS in different types of deployments so that everyone can make a sound decision. We host the same tutorial under Fast DDS ReadTheDocs page.
I’m by no means an expert in ROS 1 master, so I don’t know all the inner implementations details about it. However, I can pin-point some key differences:
ROS master is a stand-alone application. DS, on the other hand, is a DDS participant that instead of using SDP for discovery, is configured as Discovery Server or Discovery Client. This mean that anyone can fulfill any role.
ROS master is by definition a single point of failure, if master is out, discovery is out. With DS, you can run any number of redundant servers, so in case one goes done, discovery still works. Furthermore, you can connect several “sub-networks” by simply connecting their severs.
In addition, DS features a BACKUP mode, where the server stores the discovery state in a permanent storage so, in case of failure, it can recover the previous state without having to “re-discover” the entire graph.
ROS master communicates over TCP. DS uses UDP by default, but it can be configured to use UDP/TCP v4/6
Now that’s interesting! This seems like the ideal solution - each robot would run its own server, and yet they could reach topics from all other robots or a base station reliably.
Any way to dynamically change this list for already running nodes? Or does it have to be fully specified in advance?
And if the server works based on discovery, does it also mean that after a server is restarted (e.g. after a crash), it can automatically find the topology again and all nodes that used it will continue to run and be able to create new subscriptions and publications? That was probably the biggest pain point in ROS 1 - once ros master crashes, you have to restart everything…
That’s precisely one of the use cases we had in mind!
Unfortunately, the current implementation only supports specifying a list of servers on start-up. Note that it’s possible to specify addresses where a server might be in the future. Draw back of this is that Clients hail the servers in their list periodically at around 2 Hz rate until server are brought up online. However, this hailing frequency can also be configured using XML files.
Most definitely. As Clients are independent applications from the Server, the server going down only affects to discovery while it’s down, but publications and subscriptions already created will keep working normally. As you say, once the server is up and running again, it will re-discover the entire graph, a new publications and subscriptions will be discovered and distributed as if nothing happens. Furthermore, if the server is configured as backup, it will load the discovery graph from permanent storage at start up time.
Yes it can. Although not Servers not Clients have any specific configuration for this, their participants can be configured in the same way as before to limit the amount of allocated space dedicated to discovered entities. This of course requires prior knowledge of the network topology, but it is achievable in the same manner as for the SDP using XML files.
So can you explain what the metrics are for ROS 2 end users? Does it now come up faster? In my experience, the volume of discovery traffic does impact the system. At points in time, it was so high that actual messages were being drowned out due to the discovery traffic. I’m asking this because the traffic reduction is so high, I was hoping this announcement represented some user-noticeable improvements in some metric.
I don’t think the defaults for the DDS standard should impact the defaults we select in the ROS community. The standard Fast-DDS user can have those defaults but ROS 2 can ship with its own.
I agree that for some applications, the ROS Master being a single point of failure is a real selling point. From my experience and talking with the mobile-robotics community, that point was never really that important to many of us about ROS 2. Having the option to operate without the single point of failure I agree is a great attribute of ROS 2, but I don’t think that it means that switch has to be flipped on default for a first-time user experience.
ITT: in an above reply commenting about roscore in ROS1:
The single point of failure was a massive pain in multihost setups and very disruptive whenever it would go down, either from spotty network connectivity issues between hosts, accidently halting the wrong launch file, rebooting the wrong workstation or robot, when DHCP servers out of your control allocated your hardware new IP addresses, etc.
The added redundancy here with this new DS supporting multiple servers sound like a fair compromise, but it would still be nice if these new discovery servers could be “discoverable” for participants or DS instances. Sort of like decentralized discovery across servers but not entirely distributed across all participants.
E.g.: m*(m-1)/2 + n connections instead of n*(n-1)/2 where m << n with m servers and n participants. Each robot could host it’s own DS, and discovery info external of the robot would be exchanged via the discovery servers per agent in the robot swarm. Although I think this is actually closer to what the “DDS Routing Service” already does:
First of all, I’d like to thank you all for the engagement, I think there are a lot of very good points to extend the discussion!
I’ll try to clarify my previous answers. Both SDP and DS can be divided in two “stages”: one where the discovery actually takes places (all the PDP and EDP traffic I mentioned), and another one when the discovery is on a steady state, meaning everything is discovered and there are no late joiners.
Even thought the 93% traffic reduction refers to the PDP/EDP stage, there is also a massive reduction on the steady state (I do not have a concrete number as of now). This is because on the steady state, SDP participants send periodic announcements with their PDP information to the multicast address corresponding to their domain (to be discoverable by new comers), and also to each of the known participants individually via unicast. The reason they hail already known participants is to assert participant liveliness in a way that is more reliable than multicasting. They do this with a specific frequency set by the announcement period (see also lease duration).
In DS, the steady state liveliness assertion announcements are only exchanged between server and client using unicast, so as the number of participants grows, the difference in traffic on this stage gets larger and larger.
In this sense, what you’re saying makes total sense when deploying a large number of participants in a constrained network. So much so if you have constant late joiners throughout your operation. In such cases, using DS will have a significant effect in latency, throughput, and drop rate, since the steady state discovery traffic has also been significantly reduced. We have performed such large distributed tests focusing on discovery times, but not yet on those other performance aspects. In fact, it’d be very useful for us to gather test-cases from real users that really stress the system, so we can optimize for your scenarios.
As for discovery times, our preliminary results show equivalent discovery times. However, the switches in our testing environment had to be configured to broadcast all multicast traffic, as they don’t seem to be able to handle multicast correctly. As you can imagine, this is not what real users will experience, since in their case SDP will simply not work, and modifying the switches’ config is definitely not an out-of-the-box experience. Furthermore, we simply could not make SDP work over WiFi as we moved to larger number of participants. In any case, we’d like to write a proper white paper about our process so everyone can replicate our results, and for that it’d be interesting to receive some real cases from you guys!
I definitely agree with this, although we’d need to wait to Galatic for such a change. It’d probably entail choosing a default port for the server in the loopback interface, to automatically instantiate one server as the first context in the domain is brought up, and also to properly configure the clients to connect to that server. I think this is a conversation worth having in the Middleware WG.
This is indeed and interesting proposal. I’ll take note of that for the next iteration of the Discovery Server. As for the DDS routers, they could solve the multicast problem with SDP, however, they do not tackle the massive traffic that SDP entails for a large number or participants. Our intention with the DS was to provide a solution that can be used by every ROS 2 user in a matter of seconds, and that can solve the SDP discovery problems. Although DDS routers could make sense in an industrial certified environment in production, I don’t think they are a viable solution for the biggest part of the ROS 2 community.
One philosophical question. It may sound offensive but believe me it’s not meant like that, it’s my pure curiosity
I thought the switch to 3rd-party DDS libraries for ROS 2 was done because these libraries do their “one thing” very well, and are tested by (tens?) of years of real-world usage. Now Fast-DDS comes with a solution to “overwhelming” traffic caused by SDP. How is that that it comes only now? Did no application before ROS 2 try any DDS with a larger number of participants? I thought I saw mentions of networks of thousands of nodes somewhere running on top of (some) DDS…
I just can’t wrap my head around it. Is ROS 2 so far from the classical domains DDSs were used in?
I don’t think there are many examples of deployments of DDS with thousands of nodes. As a DDS provider, I know scenarios with hundreds of nodes, and in those cases, they need help with the discovery phase.
The alternatives to the DDS default discovery mechanism have been ther e for a while. We supported in already what we call static discovery, in which you supress entirely the second phase of discovery when you know in advance the types of nodes you are going to have in your system.
The server-based discovery has been available in commercial DDS implementations for years. The ROS 2 Discovery Server is just the first open-source implementation of this concept.
But it also makes me a little bit more worried about migrating to ROS 2, as our robots can run between 50-200 nodes (yeah, that could be optimized, but it’s academic research…). And we want to run about 5-10 of them interconnected. So it’s good to know we should stick with the discovery server from the beginning.
Pretty much. ROS 2 uses DDS in ways it wasn’t really designed for and has not really been used for in production applications. When we started moving into large robot systems distributed over complex, managed networks, and ROS itself creates a lot of internal topics for every node. This turned up problems in the way DDS works by default that existing applications either didn’t turn up, or did but because they were closed their finding of those problems and solutions to them were not publicised. I think it’s notable that RTI Connext DDS has something similar to a discovery service already.
On loopback you’d probably be OK. If you want to use it over the network, especially wifi, then use the discovery server and you should be fine.