Proposed changes to how ROS performs discovery of nodes

As part of the upcoming Iron Irwini release, Open Robotics is planning to address a couple of common requests/complaints regarding DDS behaviour in ROS 2:

  • nodes can too-easily discover other nodes that are not part of their application, leading to, for example, robots suddenly moving; and
  • discovery traffic by default goes everywhere, flooding networks, which can bring down networks or at least degrade network performance significantly.

ROS 2 currently includes a ROS_LOCALHOST_ONLY environment variable.
Setting this environment variable to a true value causes ROS 2 nodes to only perform discovery on the local machine.
In effect, it isolates an application from the network.
This is useful for preventing the accidental discovery and control of robots, and for preventing a discovery storm on the local network segment.
However it does not satisfy all use cases: if you wish to control a robot you have to have discovery traffic going over the entire network segment, and you must coordinate DDS domain IDs to prevent accidental control of other robots.

To make control of discovery both easier and more fine-grained, in Iron Irwini we are considering doing the following.

  • Deprecating the ROS_LOCALHOST_ONLY environment variable.
  • Adding a new environment variable, ROS_AUTOMATIC_DISCOVERY_RANGE, which will control how far discovery traffic for the automatic discovery process can travel.
  • Adding a new environment variable, ROS_STATIC_PEERS, which will allow users to specify the address(es) of other hosts (robots, etc.) to connect to.

The ROS_AUTOMATIC_DISCOVERY_RANGE variable will accept the following values.

  • 1: Disable automatic discovery. No other nodes will be found automatically, even on the same host.
  • 2: Limit automatic discovery to the local host only. Nodes on the same host will be automatically found, unless they disable their own automatic discovery, but no discovery traffic will leave the host.
  • 3: Enable automatic discovery on the local subnet. Nodes on other hosts that are on the same subnet will be found automatically, unless they disable their own automatic discovery.

The ROS_STATIC_PEERS variable will accept a semi-colon-separated list of addresses (IP addresses and hostnames searchable via DNS).
All nodes on the hosts specified in this variable will be discovered, even if they have turned off automatic discovery.

The combination of these two variables will enable, for example, automatically discovering only nodes on the local host, while also talking to a robot at a known address, but nothing else on the network.

The interaction of these two variables can be a little difficult to grasp.
The below tables will help, we hope.
A :x: indicates that nodes A and B will not discover each other and communicate.
A :white_check_mark: indicates that nodes A and B will discover each other and communicate.

Same host Node B setting
No static peer With static peer
Off Localhost Subnet Off Localhost Subnet
Node A setting No static peer Off :x: :x: :x: :white_check_mark: :white_check_mark: :white_check_mark:
Localhost :x: :white_check_mark: :white_check_mark: :white_check_mark: :white_check_mark: :white_check_mark:
Subnet :x: :white_check_mark: :white_check_mark: :white_check_mark: :white_check_mark: :white_check_mark:
With static peer Off :white_check_mark: :white_check_mark: :white_check_mark: :white_check_mark: :white_check_mark: :white_check_mark:
Localhost :white_check_mark: :white_check_mark: :white_check_mark: :white_check_mark: :white_check_mark: :white_check_mark:
Subnet :white_check_mark: :white_check_mark: :white_check_mark: :white_check_mark: :white_check_mark: :white_check_mark:
Different hosts Node B setting
No static peer With static peer
Off Localhost Subnet Off Localhost Subnet
Node A setting No static peer Off :x: :x: :x: :white_check_mark: :white_check_mark: :white_check_mark:
Localhost :x: :x: :x: :white_check_mark: :white_check_mark: :white_check_mark:
Subnet :x: :x: :white_check_mark: :white_check_mark: :white_check_mark: :white_check_mark:
With static peer Off :white_check_mark: :white_check_mark: :white_check_mark: :white_check_mark: :white_check_mark: :white_check_mark:
Localhost :white_check_mark: :white_check_mark: :white_check_mark: :white_check_mark: :white_check_mark: :white_check_mark:
Subnet :white_check_mark: :white_check_mark: :white_check_mark: :white_check_mark: :white_check_mark: :white_check_mark:

Note that only one of the two hosts needs to set a static peer address for communication to succeed.
We chose this approach to allow a host to communicate with a robot without needing to change the configuration of the robot.

A value for ROS_AUTOMATIC_DISCOVERY_RANGE of 3 and an empty ROS_STATIC_PEERS is equivalent to the current default in Humble Hawksbill.
A value for ROS_AUTOMATIC_DISCOVERY_RANGE of 2 and an empty ROS_STATIC_PEERS is equivalent to setting ROS_LOCALHOST_ONLY in Humble Hawksbill.

As part of the above changes, we are also considering making the default value for ROS_AUTOMATIC_DISCOVERY_RANGE to be 2, therefore preventing network connections as the default.

We believe this should satisfy many common use cases for ROS 2.
However we are aware that there are complex corner cases that may not be satisfied.
What we want to hear from ROS 2 users is:

  • does this satisfy your use case(s)?
  • are there any situations where you would have problems with these variables?
  • are there alternative settings that should be exposed that would better address the same problems?
  • what would you like the default values for these to be?

If you want to try these variables out, you can do so already.
You will need to compile ROS 2 Rolling Ridley from source, using the following branches in these repositories.

We would like to emphasise that the above sources are a prototype.
The behaviour does not perfectly match the ideal specified in the above tables, and even that ideal may change based on feedback.
We are still working with DDS vendors to identify the correct settings for their DDS implementations.
However it is working well enough, particularly between two hosts, to try out the behaviour and understand if it will meet your needs.
The most complete implementation is for rmw_fastrtps.

Please try it out, and let us know how happy or sad you are about this proposed change.

19 Likes

+1 for this, as long as this is documented pretty early in the entry-level tutorials for new users trying to use, say, rviz to visualize data from their robot on a developer machine. Since this is a standard task that would not work out of the box, it would be good to hedge ROS Answers questions and “ROS 2 doesn’t work” complains.

Seems like these 2 variables + Domain ID would need to become Intermediate tutorials so its on the list to know about once starting to get serious about using ROS 2. Perhaps not going into excruciating detail (Concepts for that), but enough to explain:

  • How to visualize data from your robot on a developer machine (set to 3 or add a static peer to your .bashrc / commandline for your robot)
  • Approachable reasons to use 1 , 2, and 3
  • The alternative of isolating different system’s networks via the DDS Domain

I wonder if this couldn’t be done a bit more simply with just a ROS_PEERS variable. If its not set, then we use local host by default. If its set to any number of IPs, use local host + IPs (or just IPs and on user to provide localhost in the list?). If set to empty string, disables automatic.

I think that would represent the same things but without duplicate variables to have to track in parallel.

Something about RANGE in ROS_AUTOMATIC_DISCOVERY_RANGE doesn’t quite mean what I would think it would given the name. But that would be a argument to just rename that environmental variable. ROS_AUTOMATIC_DISCOVERY_TYPE or _METHOD or _SETTING ?

4 Likes

Thanks! This sounds interesting, although complex… Too complex to be easy for beginners, and not enough complex to cover all cases…

A few ideas I have right away:

  1. Support wildcards and netmasks in the ROS_STATIC_PEERS variable.
    • Maybe support basic boolean or set operators? (mainly negation and exclusion)
  2. Support specifying a list of network interfaces on which full discovery would be allowed.

Agree, I was also confused. Based just on the name, I thought this setting will write through to TTL of the discovery packets…

Speaking of TTL, wouldn’t that actually resolve some of the problems, too? Like setting it to 2 by default, so that the usual chain PC-router-robot works, but it could not traverse further?

Also, the numeric values for ROS_AUTOMATIC_DISCOVERY_RANGE are not very well self-documenting… What about string values OFF, LOCALHOST, SUBNET or similar? Or why do the numbers not start from zero?

3 Likes

I’m going to assume here the OS takes care of the DNS lookup, so when mdns is enabled, myrobot.local works.


Are these changes back-portable to humble?

I find the use of three different environment variables (ROS_AUTOMATIC_DISCOVERY_RANGE, ROS_STATIC_PEERS, ROS_DOMAIN_ID) quite complicated and I also do not see why someone would set ROS_AUTOMATIC_DISCOVERY_RANGE to 1 in order to have no communication (even on the localhost) at all.
I understand that ROS_AUTOMATIC_DISCOVERY_RANGE=1 and ROS_STATIC_PEERS=<ip1>,<ip2> simulates multiple ROS1 masters on localhost, but I don’t understand why a node A “with static peer” can communicate with a node B with “no static peer”. Using the ROS_DOMAIN_ID instead of the ROS_STATIC_PEERS appears more logical to me, as both nodes would have to decide to be on the same “domain”.

LCM (Lightweight Communications and Marshalling) uses the time-to-live (TTL) as a way to control how far messages are propagated. This is 0 by default, meaning that packages stay on the local host, a value of 1 means that a message is propagated by one hop (i.e. within a subnet), and larger values mean that messages can propagate further through routers. This setting and the meaning of the variable make much more sense to me. Simulating this with a ROS_DISCOVERY_TTL set to 0 by default for localhost-only communication would be much easier to use in my view. Together with the ROS_DOMAIN_ID, this should cover all the local and remote use cases.

An alternative way to restrict communication between a specific set of hosts would be point-to-point tunnels that forward DDS traffic only to a specific host. For example, hosts A and B could both set ROS_DISCOVERY_TTL=0 but use a “tunnel” each, connecting to each other’s IPs to router traffic between these two hosts. That way, you make sure that traffic does not leave your host accidentally and you explicitly define which hosts should communicate with each other.

2 Likes

Do you know, if/when this change will be available in rolling? Thanks :slight_smile:

1 Like

I think it’d be useful for tests.

2 Likes

Please, no more environment variables… Why not introduce a standard configuration file that can be modified through the CLI?
This is also closer to what the DDS vendors are doing. Maybe instead of passing things through all layers, we can have rmw read it. Also easier to modify later.

5 Likes

I’ll point out a few things here:

  1. All of what is being proposed here is already possible through the DDS vendor configuration files. However, it tends to be complicated to setup, so I worry that a configuration file would eventually have similar problems.
  2. Creating some kind of configuration file for the RMW layer is a project all on its own. Should we use XML, JSON, INI, something else? What should be configured? How do we map it to the backends? None of it is impossible, but in my experience adding a config file tends to be more complicated than you expect.
  3. Doing this as environment variables now doesn’t preclude adding in a configuration file in the future. But I don’t think we should scope creep the effort to improve out-of-box behavior with adding a config file.
  4. As it stands, the way that this is implemented is as an init option to rcl. If the init option isn’t set, then we fall back to reading environment variables. This means that this can be specified per-node (via the init options), or per-process (via the environment variable). I expect that most users would end up using the environment variables, and only sophisticated users/use-cases would use the init option.

With all of that said, can you please explain your objection to environment variables? Even if we end up doing this as environment variables for now, it would be interesting to know for the future.

1 Like

I think one drawback with the current system of domain ID’s, which persists in this system, is that there is no way to discover a node without restarting some process. Likewise I can’t stop listening without restarting. It would be nice to be able to switch camera streams on from one robot without restarting the camera node with new environment variables.

1 Like

One additional note: the documentation should reflect what additional config steps are required when changing these variables. For instance, my standard system (Ubuntu) is not working when ROS_LOCALHOST_ONLY=1 if multicast is not enabled on the loopback interface (it is not by default) with Cyclone.

1 Like

From a deployment perspective, it provides a cloud native approach with consistency to scale having settings in a configuration file; it can be stored in source control, packaged in a container, tested in cloud infra and deployed to the robot. The state is easy to visually inspect and test the deployed configuration in the file.

Environment variables are simple for an individual developer, but come with added complexity to set when scaling deployment. A script is required to set the variable per system, which is synonymous to reading it from a config file.

Thanks

1 Like

In what way?

We are explicitly not trying to cover all use cases. For more complex use cases, the DDS XML configuration files are available.

We are thinking about this, but what network interface(s) to use is an orthogonal issue.

Unfortunately this is a significant change in behaviour, so it won’t be backported.

The downside of this is that you have to change the configuration on both your computer and the robot. This doesn’t work in situations where you want multiple people to access a robot in turn but not conflict (except intentionally), such as in a classroom environment.

Not for a while yet. We need to iterate on the design and the UI, and work with the DDS vendors to make it work smoothly before we can merge the branches.

I think introducing a configuration file at the rcl level is doable but is a larger piece of work than, and separate from, this change. Introducing one at the rmw level doesn’t make much sense to me because there are so many different RMW implementations.

1 Like

I would generally just prefer that there is a documented and tested method for setting these values programmatically. If Open Robotics decides to stick with environment variables to configure this by default (and that doesn’t work for me) I at least have a supported way to do whatever I want, like creating my own configuration file.

We’re talking about two configuration variables here, which apply to the entire system. The complexity of the DDS configuration comes from having tens of configuration variables, and the general approach taken by vendors to make them configurable on a very fine-grained level (such as individual readers/writers). Therefore I don’t think we are at risk of getting anywhere near that complexity anytime soon.

I am aware of the complexities and think those could be addressed, but I think you’re also referring to the fact that it might be more complex than warranted for this particular use case. This is a bit of a chicken and egg problem, right – if we already had a configuration file infrastructure, it probably would be easy to add. But we don’t have one, and if we always use environment variables, there’s no point to create one.

  1. One objection is related to the use of environment variables for configuration in general: They can be set in a variety of ways and the interaction between these ways frequently causes issues. For example, environment variable set in .bashrc are often not available in graphical tools started from the windowing environments menu. Also, SSH forwards a few environment variables by default, but skips most, and this can be configured both on the client and the server side. Another common tool, docker, has the same issue, where envs are only passed on when either specified through a command-line argument or in the global config.
  2. A more specific objection in this particular case is that we have two closely interacting variables, which are, however, not necessarily set together (in contrast to a configuration file, where they are set in one place). I would consider this even more confusing.
1 Like

To successfully communicate with a robot, the user will now have to get at least 2 variables right out of 3 - ROS_DOMAIN_ID (the defaults are okay if the user does not change them), and ROS_AUTOMATIC_DISCOVERY_RANGE (this one would need to be set to its non-default) or ROS_STATIC_PEERS.

Beginner users are often in the unfortunate situation that they do not know what they are doing, and if there are multiple ways to achieve the same result, they get confused, choose a random way, which may hit them back later.

ROS_AUTOMATIC_DISCOVERY_RANGE seems to me to be an advanced variable which should be “hidden” from the users in beginner tutorials. Just teach the users to use ROS_STATIC_PEERS to connect to a robot. That would be fine and it would almost match the UX people had with ROS_MASTER_URI on ROS 1 (which I’d say was pretty easily digestable even for novice students we had). The only difference would be that if you connect more computers to one robot, they would not be able to see each others’ topics if they publish some.

Okay, that’s a fair point.

Orthogonal to what? I still see it pretty important to be able to configure which interfaces take part in the discovery process. I think this could be a complete substitute for ROS_AUTOMATIC_DISCOVERY_RANGE. By default, only loopback would be allowed. Then users could choose whether they want all interfaces to take part, or just some. A special empty value could then disable discovery completely.

1 Like

Thanks for posting this!

Along with above comments, I am a bit concerned about the complexity; however, I think it’s fine to have complexity at this prototyping stage, as long as we can openly allow discarding the prototype and doing a redesign (if that makes sense).

For logistics and testing, I always find it very hard to posit through design and code propagated among multiple repositories that aren’t linked. I have two request:

(1) Instead of linking to separate repositories, can you centralize your dev work in a new repository with something like vcs.repo declaration of your branches, along with vcs-exact.repo that has exact commits as you are developing? (see small example here for what I’ve used when prototyping libsdformat changes in coordination with Drake).

(2) Can you make all of these discovery options easily testable on one machine using containerization / namespaces?
If not, can you provide mechanisms for quickly provisioning and testing against multiple machines to confirm behavior as expected? (See rough example here for testing libfranka timing issues on their UDP packets)

Do you envision how this work for multiprocess testing?
In drake-ros, rmw_isolation functions in such a way that admits (somewhat “random” / minimal conflict) multiprocess communication without exact specification.

Which aspect of this new specification would admit that?
If there isn’t an existing mechanism, it would be excellent to include it.

Agreed 100%. We should aim towards minimizing indirection by making things as self-evident in as local of a fashion.

For those interested in trying the changes out, I have created a repos file for use with vcs and stored it in this repository.

However note that you should compile this with a full ROS 2 source compile (due to workspace overlaying issues), so should prefer the repos file that provides that.

While working to get this implemented, I’ve found myself wondering: What’s the use case for OFF? If a machine (named OFF_HOST) uses the OFF setting, none of the nodes on the same machine will discover each other but every node on a static peer (named PEER_HOST) will find every node in OFF_HOST. I feel like this creates a bizarre asymmetry in the node topology, and I just can’t think of a use case that would motivate it. Especially since discovery within a single host shouldn’t affect network traffic anyway.

I could understand a motive for the OFF setting if static peers were granular enough to specify certain nodes on certain peers (e.g. allow specific nodes on your localhost to discover each other while isolating the rest) but that’s not the case. So instead we’d just always have asymmetrical discovery with this setting.

Another consideration (unrelated to the OFF setting that I just commented on) is whether there should be a setting that blocks static peers.

According to the current design, ROS_AUTOMATIC_DISCOVERY_RANGE=LOCALHOST will restrict multicast discovery to just be within one host, but would still allow an external host to declare it as a static peer and have discovery succeed.

It seems likely that users would want an option that both restricts the automatic discovery range to localhost and prevents any external hosts from punching into the system by declaring it as a static peer. So to that end I would ask these questions:

  1. Is such an option desirable?
  2. If so, should it be a new value of ROS_AUTOMATIC_DISCOVERY_RANGE=, e.g. ISOLATED or should it be a different environment variable altogether, e.g. ROS_ALLOW_STATIC_PEERS=OFF.
  3. Will the DDS implementations even be able to support disabling of static peers? Does the DDS spec allow static discovery to be blocked?

Edit to add: A possible use case of the combo ROS_AUTOMATIC_DISCOVERY_RANGE=OFF + ROS_ALLOW_STATIC_PEERS=OFF would be to run many tests nodes simultaneously (where each test node has no need for inter-process communication of any kind) without needing to dispatch unique ROS_DOMAIN_ID values to each process that contains a node. This might be a useful feature for CI.