Proposed changes to how ROS performs discovery of nodes

I’m on board with either using ROS_AUTOMATIC_DISCOVERY_RANGE=ISOLATED or ROS_ALLOW_STATIC_PEERS=OFF.

One downside to ROS_ALLOW_STATIC_PEERS=OFF may lead to exploding complexity of configurations. However, I don’t think this is too much of a problem. The example use case given by @grey is definitely nice.

A crazier alternative would be to have a single ROS_DISCOVERY_INTERFACES variable. To mimic ROS_AUTOMATIC_DISCOVERY_RANGE=LOCALHOST one would simply add the loopback address to the ROS_DISCOVERY_INTERFACES variable (i.e ROS_DISCOVERY_INTERFACES=127.0.0.1;) . No discovery could be ROS_DISCOVERY_INTERFACES_OFF=TRUE. And discovery on specific interfaces could be ROS_DISCOVERY_INTERFACES=<list interfaces>. Alternatively this could list trusted interfaces for ROS_STATIC_PEERs to listen on.

Throwing my two cents,

Users could always achieve this with SROS. In fact, they could leverage SROS in combination with this feature to discovery some but not all the nodes running within the same ROS 2 Context.

This could definitely be achieved by the means of the meta information you’re already adding to the user data in combination with DomainParticipant::ignore_participant. Don’t think about it as static discovery, since it really isn’t; participants still exchange discovery information, whereas static discovery would mean that everyone knows all that it needs to know from the beginning. Think about it more in terms of hinting your participant about where other participants are.

This could be enabled if the way of specifying the peers where to contain not only the address but also the port. For instance, when using Fast DDS initial peers, if the port is not specified for a given peer, Fast DDS would expand that peer to five of them, each of them with ports corresponding to the ports that the first five participants would use (as specified in the DDS spec). To ease the use, we could show some info regarding ports on ros node info.

I feel like this creates a bizarre asymmetry in the node topology, and I just can’t think of a use case that would motivate it.

+1000 (per our VC)

As a user of ROS 2 in a research setting, I have a wish that minimal configuration should maximize expected behavior – and possibly “maximize safety”. Perhaps this is achievable if it’s easy for me to reason about possible failure modes arising from “unexpected / undesired traffic”.

Have bizarre asymmetries could make this wish harder to achieve.
Having host A say “don’t talk to me” and host B to say “hey, poke into that host A’s traffic”, and allowing that, seems suboptimal from this perspective.

If the counter-argument is that the setting is qualified as “automatic discovery”, then I would argue this configuration arrangement may not be sufficiently simple.

I believe that is in line with @christian’s comment in Oct. 2022. @gbiggs’s response about having a implicit topology that allows for certain routes of communication sounds nice, but IMO, users should be much more explicit about their topology if they want introspection along certain directions, vs. having an isolation flag but saying “Hey, it’s not really isolated”. Ideally, being explicit does not have to mean being uber verbose with a narrow / fragile set of flags.

A crazier alternative would be to have a single ROS_DISCOVERY_INTERFACES variable.

I wouldn’t mind something like this, especially if it helps to collapse the number of moving pieces!

My vote is to make it as explicit as possible, e.g. if the variable is present but represents an empty list, it should be taken as that. No extra _OFF flags, nor gymnastics to say “it’s defined, but empty, so assume undefined, so use default”.

Users could always achieve this with SROS.

SROS 2 seems excellent (and possibly necessary?) for production deployment / testing!

However, expecting users who are prototyping in a research / development environment to learn about, configure, and deploy SROS, and pay the runtime overhead (as well as maintenance as the feature evolves) would not line up with a wish for minimal config for “maximum correctness” (failure modes that are easy to reason about).

Re: disabling static peers for FastDDS, sweet! Glad to know it’s possible!
But ideally, it would be nice to flat out restrict the traffic, discovery and otherwise.

All of this being said, using ROS 2 and its DDS implementations has been great! There are ofc roadbumps, and we all want it to be “greater” while also being “simpler”, but per the aim of the work Geoff outlined and Grey, Arjo, and others are working towards, I think we’re close to making that happen!

Don’t forget the educational scenario, too. It’s not viable to ask students or teachers to configure SROS for a class robot and class computers.

2 Likes

After spending more time thinking and working on this problem, I’d like to propose something that combines my previously mentioned concerns into one solution:

I suggest changing the requirement chart to look like this:

Same host Node B setting
No static peer With static peer
Off Localhost Subnet Off Localhost Subnet
Node A setting No static peer Off :x: :x: :x: :x: :x: :x:
Localhost :x: :white_check_mark: :white_check_mark: :x: :white_check_mark: :white_check_mark:
Subnet :x: :white_check_mark: :white_check_mark: :x: :white_check_mark: :white_check_mark:
With static peer Off :x: :x: :x: :x: :x: :x:
Localhost :x: :white_check_mark: :white_check_mark: :x: :white_check_mark: :white_check_mark:
Subnet :x: :white_check_mark: :white_check_mark: :x: :white_check_mark: :white_check_mark:
Different hosts Node B setting
No static peer With static peer
Off Localhost Subnet Off Localhost Subnet
Node A setting No static peer Off :x: :x: :x: :x: :x: :x:
Localhost :x: :x: :x: :x: :white_check_mark: :white_check_mark:
Subnet :x: :x: :white_check_mark: :x: :white_check_mark: :white_check_mark:
With static peer Off :x: :x: :x: :x: :x: :x:
Localhost :x: :white_check_mark: :white_check_mark: :x: :white_check_mark: :white_check_mark:
Subnet :x: :white_check_mark: :white_check_mark: :x: :white_check_mark: :white_check_mark:

In summary: Nodes with the OFF discovery range will simply not discover endpoints in other processes, no matter where those other endpoints are hosted, what the other endpoint’s configuration is, or what the static peer settings are. I have two motivations for making this suggestion:

1. Use Cases

The only use case I can think of for completely turning off automatic discovery is to run isolated unit tests within single processes. It seems very unlikely to me that anyone using the OFF setting would actually want an outside process, whether on localhost or a remote host, to discover their isolated process by simply including its host as a static peer.

I can imagine a possible use case where test nodes are running in the OFF setting and a user wants a test-observer node to be able to tap into specific individual test-runner nodes, but we’re not offering that level of granularity with the two environment variables.

Given only these two parameters of ranges and static peers, I think the most likely desired outcome is that a node with OFF does not want to be discovered at all.

2. DDS Discovery Protocol

What I know of the DDS standard is that it is a discovery-hungry protocol. It wants to discover participants, and you need to put up barriers to block that when you don’t want it to happen.

I don’t think there are standard DDS mechanisms for achieving the original requirements matrix. Instead DDS implementations would need to provide a mechanism for RMW to arbitrarily reject connections based on participant info. While this feature can certainly be implemented upstream by the DDS vendors, it might not be reasonable to demand it, especially if we’d like the feature to arrive in Iron Irwini.

Instead, the table I’ve proposed can be achieved without injecting any custom logic into the DDS discovery protocol. We can get it by toggling standard aspects of the protocol:

  • SUBNET: Usual settings, all DDS discovery mechanisms active.
  • LOCALHOST: Turn off all multicast discovery. Turn on unicast discovery for the loopback address plus unicast discovery for any static peers.
  • OFF: Turn off multicast discovery. Give the participant a unique domain tag based on its process ID. Ignore static peers.

These are all standard DDS mechanisms, so I expect (almost) all DDS vendors to be capable of supporting them.

I don’t generally think that ROS 2 should be constrained by what’s available in the DDSI specification, but combined with the use case concerns and the timeline that we want for these features, I think revising the requirements like this would make sense.

1 Like

Just found this thread, some great info. It seems like there is asking of a use case, so here’s a relatively complex one for multiple vehicle multiple user with both local and networked comms.

Key requirments:

  • There is more than one transport for DDS
  • Drones can communicate over mesh or client/AP, whatever is available. It’s dynamic during operation.
  • Each drone has a set of “private” topics at high rates that NEVER need to go over network. These are internal interfaces in each drone.
  • A realistic number of private topics on each drone may be on the order of 30-40, so private discovery traffic shall NOT clog the network.
  • Each drone has a set of “public” topics that go a low rates to other drones or users. Each drone knows where the other drones are at 1Hz as long as they have either mesh or AP connection to the other drone. Users only get public topics too.
  • Different users have different permissions. User A has ability to send data out, while B is read only, perhaps mandated by SROS access control.
  • User commands may be directed to a single drone (drone1 go to waypoint 1) , or to all drones (all drones come land command)

I don’t understand how one would use environment variables in Drone1 to launch the nodes. You need to configure GPS,IMU,and camera nodes to be localhost only, but output of fusion is not. If it’s a single launch file for the drone, then you only get one environment to launch.

If you need to modfiy the diagram, I drew it in draw.io. You can import the following XML:

<?xml version="1.0" encoding="UTF-8"?>
<mxfile host="app.diagrams.net" modified="2023-02-28T16:30:06.991Z" agent="5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/110.0.0.0 Safari/537.36" etag="FTuerD_ndhn0wYnKQ5cS" version="20.8.23" type="google"><diagram name="Page-1" id="plCf7v0-zyhF3JTSrZRu">7V3fc6o4FP5rfLwOAYLwqHa7d2d6Z+5O587ufdqhGpW9aFzA1vav3yA/hBwQpEBSal8qUSKefPk453wnYaTNt8ffPXu/+UaXxB2pyvI40u5Gqop0TWH/wpbXqMVUUNSw9pxl/KFzw6PzRuLG+Lz1wVkSP/fBgFI3cPb5xgXd7cgiyLXZnkdf8h9bUTf/rXt7TUDD48J2YetfzjLYxL8CK+f2r8RZb5JvRkr8ztZOPhw3+Bt7SV8yTdpvI23uURpEr7bHOXFD4yV2ic67L3k3vTCP7II6J+w1Y3s0396+rp7Jzz+n/07cp7+/xL082+4h/sEj1XBZf7Ol8xxedPAaW8L47xBe6Sz3ah3/P53h7+1d4SlP9uLX2qOH3fLLgrrUG2nT8Lp2TuDYbmGXdx7dEZR0zH5T1Hf++1jz6SKTVjX35erLxgnI495ehMcvDJ3sQ5tgy77wDrGXtr+P8LJyjoSZaLZyXHceXR47XVutVupiEf6swKO/SOadpfFkYCPswnXWO9a2YCNA2JuzZ+IFDsPONH4joPv0ysL3yLF07FCKCDaVCN2SwHtlH4lPSDAUT6JJfPhyRiQy4rZNFo1Jox3PgnXa8xko7EWMlStwgwFutsTfIDAIkfGSmaPmh4DNiH34ue1xHZLHeOXSl8XG9oLxlnhr8g/1/vED6p1m6KxoOLkhm8/v7y2rE4urGjT5pMDiuCuDG8DgC9dhv0WwyQ32d3/fjskRkszmZiHI1QGB3JDM4lYJygXbvE2Ua1gymyPoA4Qw14YDc92SzeRqCc4FG71NnGNTNqNrwOhS+Lo/fOJNb65u6gUoedxgRbizi6DzJQ1yZjfkpDSvY+mQMwHIsfetuezkGHj2Ighpfkts/+CR7clUvTvu3P1Vx6KpHnru0f0VToAPbnqsYclMD1340JucymD4mXZ/bxjd3KaEGz5JWHCGlwLxbRqeZ3nxhi/yKzmbk+WaPMaHO7oL3XHqBRu6pjvbfaDhHfBk/n9JELzGSXD7END84Jycg/A2fKdkhsp1dr/YITOf9/p3+N4YJ4c/4xNPB3fH+MTo6DV79J14DrNGeH8+NUY/ILzqy6OkKoHNAo3ggnnM4tHMjFYSKHjEtQPnOf+VRcMV9/adOuxizjNSL5mRSRc+PXgLEp91HnTQUQqfuCONdw+i3ww6mnqe/Zr52D78gH/hgjXue3Augc9eRD2eoZna7x1o1WVDq44nWbwqY8uwmmCWHJ0gg3929DPzzrmn8OA1c9Ac+xGgatwMKycJ7muS8NBuOkcQrujo6jmS4+A2gA4lihst81m/j8XLek+8XPI93fJyUY5BHrgqguGaUcy7hqvWElx1Ts7S1Y7gykmV2qQPuMLERhhtQMi6rrP3SQZoC5celiVhQwalXUYQnL0QDCBSm2YjCLOzCAKmK3qZ+2ePSWdBTMZnGita4kPV95uyLp2pcy6d3i1/VHtik5pEY/TFMxqquJvV5RnNavu22LorBnNC/eA7G2RwgLQmgu9oVl9A40tZmt/Q+JSXIhvQklu3OCJ9X+gpKFhojz0jYbsX+uQDUP7uXBvVfBaG70g8qouKNPulT6liAdRbSlFtKxbAHHWm9+yWYwHgC1g9xAKaCvDJhlsJ1WmA0yQKOGzd6SKgWT34wX4i7nfqO4FDQ134iQYB3ZYKxhnY0kPAwlcyT0vAlW4DCM3ihOaCelytIIDoTGbWoAQxHfQIcHWLaa2RsAGAycYffwDTn+kVXR0ATw1TmZktaZdcBGzo0HyFETCfwWjPfjD7JbH99Dr2M3Cf9oMZhPmDvAbkxfMiA+q94g9GqN+YPdq0X1jO2lZBK6iRM6D9zD7tp0MXABiP7JbTcEHWKMwB2r7vLOqFVGr9mGrUYsgT31IrvVGtOpDHBUOB2/FQEVdli7WGHipwdWuGQW05kTp0YZgnQlgL+voGscSG+OSt5CFUWWu4dZbLsI+ZR3znzX469RciJ3asWed4NsJ3l6ZpvNIwPnmUTqosyi5MktJJ/YVBXU2IPB6FZH3eO1GSdJMsH8ufT1crn3QzqHWKDRrSwuSKVMs5kJ1YRo+h7GBYhku2YNSQZRDW8x1NemYZ6KcPjmX0T8cydaTzhiyDjawwhurSDEK5miZmdMXUmnBNxwVKid9bzT/VFUo3/qkBVSibD45/ShTUAfNPHfm+Vy/nKvGoLp2Imvx8vUaaLHhviIP7DnFgkiHZ50L5y37dp9/78TmgRKHpngMShzrp1eiLBBKwd+KEmFx1TuRd1It3cuHOWFGMCjoQ4YZgtaYbEg2vMDdkYoaEnCOjxgXUGhqrnIxUUxRsDbJQVx4oH+GS6tHu+ShJjKa99uaVJHMqm9J2GOfQnQ8GVZq8Np/NFJ3XxjArKbMuwJcZiLcfjHdauyt2oAuIurFgbI11TUn/8oEM5uvsaledXO62Zp1oa4QEw6Shxb4R3D9T7IvrlPbeJjjGoetdOhOb1i5e7hYEyh1PcAOGQIOb4CUyzHAnuAGDhN+/P4LRlKWyhKt1r1kX1pn/Y0An/I9vP6S1Hrc6GBVsoNOv+aD7Pbe3xLOltSC/erAAgL0WNhlQgb8/hEFgqxZsM4bhZjAq2N6j1xDGELSOvHhNSW0FcnJVyf6opv9UI5dn1C1pMEqE6vbL+Llq4aYlUmBduGwLRYyCTYLtLWvQlM1AHDCjRIeOu1fGaGIarYCGVxLMfAcdelx1cibyEY4qjHDqrhvqj3A4P6qpYMmXPdSM53okHJjS2dnPvh2snCNrxoNhncq8DtJRvrKkpagvrxClezX3wEKi1my/j4WEkZAqHQlx++ShpjVTmOtHuu0FErbNQNXZHlgDi18Hw0CViaeuGMjKM5DWFwFN6qx6vmWWLWuM1dIUMOLrG+tO+svd1g19WsMCTKMNLbM8uZwVGWBm2SyKcyTYVf327CkuIcelT8Q/fMqEYYfUmgRnQNGihAlde5lFCX6jLOGqhAX9zQdnaXvyWnBSCcBeRQkLunayixLcFC5SJVJc9mND6BIVOkKSxefCZIlk0lbvy1btzre1hRW/KVBDX10zL/cjPEC3oIzrRoTJQvShROjRhBy0MmEJ2jv+oyoTiZ8qE+fwmxI35Rz+kT/ScQ6U7QcpTViVy70HJ01YgrZK/6DSRBLuScRCqtmSNqFasosTFsxyDU6csCqXfA9NnLA6XPI9IHFC00IZIT/wTRWJgr76liEsmDgbmgxhCVu5LUqGSNM2Nx1Cbh2Cf4q3eB0CJbLoBxEieAuKFiKQogL7yaxEYK4AWLgSgRSYWXuwi3Z9lsWEhlIJwV6lCKR8uAUS/CwWr0Ug5bZG4qqQPJ231RueKL3VC2L+QRgNHXWDf7CRbCE5UmACyY1YU9GGEpPHc3LQegRSbkslruSdumsl+uQd/vmgTXmHf86VfLzzSZZLxPPyU4kSSLktmLiOilQJqQi1tWZClV2XSH/akIWJeFJ+JmUCIZiTgoN5kyaw0Z40UdBX39IEQjCVNjRtApU9s37A4gSCCb7WZvPH3lZWFHHwu0CoydONr98BNn+PAPTTOWXA1OdA939FZfsGd88c4jaARQhmZeMdYMGoypLb5h9s1uHuP+zQo6HgerY5+6Wbb3QZYvu3/wE=</diagram></mxfile>
1 Like

That sounds very reasonable to me!

I was wondering the previous table too, could not think of actual use cases for those. OFF prevails to any other setting for discovery does make sense to me.

You’re right that this kind of topology is not supported by these environment variables. I doubt environment variables will be enough for such a complex topology. The work we are trying to do here is primarily incremental. Our main use cases are:
(1) where there are a lot of ROS systems on the same network, there should not be any public communication taking place.
(2) there should be an easy way to connect to a robot running ros2 and inspect topics by an external user.
(3) there should be a way to run automatic discovery, but only if the user asks for it.