Restricting communication between robots

Doing this would require first deciding that we want to lock ROS 2 to only use DDS as its middleware.

The alternative would be to decide what concepts we want at the ROS 2 level (such as the ability to do hard and soft partitions of the graph) and then design the mapping of those to DDS concepts for provision by each RMW.

The former is less work but the latter offers greater flexibility and capability to use non-DDS middlewares. However the latter would have to involve some level of compromise and feature reduction.

Agreed, but at some point I think there hits a point where the direction we’re going indicates a distinctly DDS approach, especially if making that decision can make ROS2 more easily feature complete (as this topic would want) or faster (that @joespeed thinks we can make the ROS2 layers much more streamline that way).

A middle compromise would be to have the RMW’s be required to be DDS but also supply a shim RMW that a less-complete communication middleware could use to implement the most basic things and the shim provides the rest. That way we can push that overhead out of the ROS 2 stack for most users and the overhead is only seen by users that have to use it. We do this in navigation alot by having plugins loading plugins. We have a non-DDS plugin that fits inside of a generic RMW container.

1 Like

I think you bring a very good point, it would be ideal to do this in a way that all kinds of middlewares can benefit from it, and it is definitely a very interesting discussion we should have.

There is still the issue of taking full advantage of the specific middleware implementation the application is using. All implementations have their different tweaks for different use-cases (apart from the standard ones). Maybe we can think of a way to configure those from the application layer in a way that is common to all of them. We could for example have key-value pairs that are pushed down and interpreted by each RMW implementation. We could standardize some of the keys to the concepts we deem common, but leave the implementation to decide on what to add there. Of course, this approach would require documentation for each specific RMW, but I think advanced users and industrial deployments would greatly benefit from being able to configure every single aspect of the middlewares.

With that approach (which is probably one of many possibilities), we could configure middleware implementations at build time, or even at run time loading some JSON file, which would make very easy for users to test different configurations without rebuilding their application every time.

We’ve discussed a similar idea in a few recent design discussions, but not a shim as such.

What we’ve discussed is providing default implementations for some of the more specialised features that we want to use - content-based filtering being a particularly good example. An RMW can then either provide support for that feature using its middleware, load in the ROS-provided implementation, or provide that feature in some other way of its choosing.

This approach is more flexible than a monolithic shim RMW, but it requires designing some APIs.

I disagree, I think they’d be the same and it gets that code out of the ROS2 stack entirely for DDS implementations that most users will use (at this point). You could still have the RMW shim implement the virtual void contentFilter(...) {...} or other features, and then the specific networking vendor could decide which of the specialized features to override and implement themselves. But for all other RMWs or middlewares, that code isn’t even in the stack anymore (reducing sizes on disk and execution speed).

But either would work. The point I was making is I think you can still do that in this framework as well. I think there may be even some benefit to doing it this way in terms of library sizes and latency (but I’m unfamiliar with the specifics so I could as easily be wrong).

2 Likes

and if we implement OMG DDS C++ API RMW then you can scrap all the different RMWs and have 1 implementation to maintain that works with cyclonedds, fast, rti, et al. Is crazy that we have a different RMW for every DDS implementation when the point of DDS is to have consistent API and interop

  • implement RMW for OMG DDS C++ API

  • All REP 2000 Tier 1 DDS implementations support the OMG DDS C++ API: cyclonedds-cxx, fastdds, connext

  • The current approach to RMWs involves much effort developing, testing, debugging and supporting 3x functionally equivalent yet incompatible RMWs. When ros2 API changes all 3x Tier 1 RMWs must be updated & tested

  • Erik Boasson suggested that switching to DDS C++ API can cost a few % performance but thought it a reasonable trade for removing this burden on Open Robotics and the community

(apologies, realize is slightly off-topic but this has been bugging me for a year)

This question has been open for over 3 years. In that time the first LTS version of ROS2 has been released and unfortunately there’s still not a standard documented way of supporting multiple robots on the same network.

While companies that provide closed fleets may implement their own solution, for the sake of the community we should avoid this becoming the ROS1 multi master problem that was never fixed.

We (PAL Robotics) work a lot with universities and research institutes and we’d like that the people who use our robots can do it without worrying about conflicts with robots from other companies in the same network using ROS2. But this won’t work if each robot has to play by different rules.

DDS’s discovery mechanism, while powerful can be extremely dangerous.

How can we push this forward? Is a working group needed? Are contributions needed anywhere?

4 Likes

Even a successful reference design from someone with a solution using ROS2 would be valuable. In particular handling restricted topics, TF, and a system diagram.

I thought this is supported by setting different ROS_DOMAIN_ID per robot/network:

The ROS_DOMAIN_ID is limited to 256 different numbers as far as I know, so it’s not possible to assign a unique id to each robot. Also development machines need their unique ids.

We would need some kind of DHCP server for distributing ids among the online robots, and so would our customers.

Yes, you cannot have more than 256 robots/networks in one subnet but you can have as many machines within each “ROS2 network” as you want (max. subnet size). If you don’t have more than 256 robots in your subnet, every robot can have a unique ID.

This property of UDP multicast traffic send to all participants in the same subnetwork is not unique to ROS2. There are many networking tools (subnet partitioning, VLAN, VPN) to partition the multicast traffic.

I personally think that using separate VPNs per robot is the safest option, as you can also force users to authenticate to make sure that they do not send messages accidentally to the wrong ROS “master”.

1 Like

i am not sure the following helps, just wanted to share our ideas. (of course, it is all dependent on use cases and actual environment, and always trading-off.)

  1. Physical Network Configuration (Physical Boundary, it is safe but costs)
  2. Software Defined Network (Logical Boundary, safe but overhead)
  3. ROS_DOMAIN_ID, i think that this is something we can use based on promise. (not secured)
  4. SROS2, authentication and access control

ROS_DOMAIN_ID can be used to create districts, but as i mentioned, which is something based on trust, so we would consider that only internal use. (i think it is the same for namespace and partition.)

to support 3rd party nodes, authentication and access control from SROS2 could be useful. but i think there is a question how we can manage the keys and certificates for each node for distributed system? as distributed system, this keys and certificates should be attached to the specific hardware when application nodes start running. that is kinda off topic, but it could be distributed storage. (actually we’ve been considering Kubernetes ConfigMap)

1 Like

@v-lopez: isn’t part of the challenge that both isolation as well as poking holes in that isolation at opportune times are desired (ie: when isolated robots actually do need to communicate)?

Yes, both are desirable. But I would believe intentional inter robot communication comes after safe single robot communication in a shared network.

Even developer machine isolation, how do companies with multiple developers on ROS2 work? OSRF uses DOMAIN_ID as far as I know, is it a standard practice for other companies?

The fact that someone doing their first ROS2 tutorials could be affecting someone else’s simulation or physical robot is a radical and potentially dangerous change from ROS1.

Just yesterday I was going crazy seeing nodes alive after I had killed them while working from home. After some head scratching I found out I was seeing some colleague’s nodes on their home pc through the shared company VPN.

@christian and @tomoyafujita have provided excellent ideas, as well as @spiderkeys here.

Maybe a first step would be to gather all of them somewhere with examples and detailing their limitations and costs, as well as add documentation that makes this behavior clear, or we provide some default settings that disable it (I’m testing export ROS_LOCALHOST_ONLY=1 now).

5 Likes

Just to provide an update on how we are working around this issue at my company (which may not work for most users, as it is strictly limited to DDS RMWs), we are making slight modifications the RMW implementations (rmw_cyclonedds_cpp and rmw_connextdds) to allow us to get a handle to the underlying Publisher/Subscriber entity and modify its partition policy dynamically, following the pattern I laid out in the github issue @v-lopez linked above.

On each robot, we have one ROS2 node acting as a Beacon publishing its UUID with a default wildcard partition, and all other nodes have their partition modified to reflect the robot’s UUID. On the client/peer side, this allows the flexibility to make an explicit ‘connection’ (discovery limit) to a specific robot, or use the wildcard partition to subscribe/publish to all robots (mostly useful for introspective/debugging purposes, or to subscribe to the Beacon topic to discover participants in the network).

With regards to the use of DDS domains, we tend to use these to completely isolate logical domains, with bridging nodes at the boundaries of domains, which is a more common pattern in larger DDS environments (see: military/government systems).

I’ve put together another small visual to explain how we use them:

In this example, we have three separate machines with different network topologies, resulting in three separate communication goals. We use a combination of Domains, Partitions, and Transport to achieve each of these goals.

Goal 1. Isolated internal vehicle communication

  • Potential significant reduction in discovery traffic/overhead
  • No external system should be able to directly communicate to these nodes
    • Achieved by:
      • Placing all isolated internal vehicle nodes in a unique Domain (ID 1 in the example)
      • Using the Participant’s TransportProperty::allowed_interfaces_list/denied_interfaces_list to limit which network interfaces can be used for each transport (in this case, Localhost for UDPv4 )
  • We should still be able to debug this system from an external device if necessary
    • Achieved via a few options:
      • Use a third network interface (virtual perhaps, or an Internet Sharing setup with ethernet/USB)
      • Rather than modify allowed_interfaces_list, you could explicitly control discovery/allowed_peers on the participants to only allow your specific debug machine to communicate.
      • Create a bridge in the Vehicle<>Client domain or another domain that exposes all topics you need
      • As @tomoyafujita mentioned, you can also generally utilize security/access control as a method to achieve this, rather than completely denying the shared network interface for the internal domain

Goal 2. Communication between ground control app and vehicles

  • Client applications should be able to communicate to any and all vehicles
    • This is the default case on the default wildcard partition
  • Client applications should be able to establish limited communications with specific vehicles or specific groups of vehicles
    • Achieved by placing Vehicle Bridge nodes in <VEHICLE_UUID> Partition and/or <GROUP_UUID> partitions and creating nodes in those partitions in the client app
  • Vehicles should not discover/communicate with each other in this domain
    • Achieved by limiting the allowed network interfaces for vehicle participants in this domain

Goal 3. Communication between vehicles

  • Specific vehicle nodes should be able to communicate with each other on a special V2V network interface
    • Achieved by placing these vehicle participants in a unique UUID and limiting their allowed interfaces to the V2V network
  • Neither internal vehicle nodes nor client applications should be discovered/communicate on this domain

This is, of course, one example of a mixed-domain topology. It has the benefits of strongly isolating and controlling discovery/overhead and avoiding difficult per-vehicle configuration issues. You can generally use the exact same network configuration/domain ID/code deployment for each vehicle, with the one unique identifier being some assigned or derived UUID (can use MAC addresses, CPU serial numbers, etc…). There are also some advantages to this approach with regards to DDS security, in that you can apply access controls at domain interface levels.

Happy to answer any questions, if this sort of thing is useful to others.

Note: Some of the DDS QoS policies relating to transport config that I’ve mentioned by name are vendor specific, but all of these facilities generally exist in some form across the implementations.

6 Likes

Minor remark: 232 is the maximum domain ID based on the default port numbering scheme in the DDSI specification.

The DDSI specification gained something called a “domain tag” not so very long ago, and that’s basically an extension of the domain id: it is an octet sequence that must match for participants to communicate with each other but it doesn’t affect the port numbers.

If you run all systems on domain id 0 but with a different domain tags for each system, the participant discovery messages will go everywhere, but those from different systems will be ignored. That means no discovery of the readers/writers, and that in turn means any data that happens to arrive is also dropped because the writer is unknown.

You can also combine them of course.

Minor remark: 232 is the maximum domain ID based on the default port numbering scheme in the DDSI specification.

Correct, for example, Cyclone DDS is happy to let you use any domain id in [0,2**32-2] as long as the port mapping yields valid port numbers. At the same time the domain tag was introduced, the domain id was also added to the discovery data. That means it is now safe to set have different domains use the same port number. In its most extreme form, you can just set the “domain gain” to 0 and make all domains use the default port.

In other words, domain id and tag prevent cross-communication, different port numbers and different multicast addresses reduce unnecessary network traffic.

I know some of these can’t be set via ROS 2, but there’s no reason some of this can’t be added to it once there is consensus on the interface. Until then, at least with Cyclone DDS, you can set it using the CYCLONEDDS_URI environment variable, e.g.,

export CYCLONEDDS_URI="<Discovery>
  <Tag>Jantje zag eens pruimen hangen, O! als eieren zo groot</>
  <Ports>
    <Base>12312</>
    <DomainGain>0</></>
  <SPDPMulti>239.255.1.3</>
</>"

It might be interesting to experiment with this a bit.

(And yes, it’d probably be better to make a proper configuration file rather than using abbreviated XML in an environment variable … :slight_smile: )

I know this thread is now over a year old but has there been any progress made on this matter?

It seems to me that you could utilize the eprosima fast-dds discovery server to keep discovery messages within the robot.

By setting an environmental variable that points to the discovery server, and then starting the server on the ROS2 robot, that should keep all that dds traffic in the robot because the traffic gets directed to the discovery server.

I actually just installed wireshark and i am in the process of verifying that the udp data does stay within the robot. So, i will update if this is the case (hopefully in a couple of hours.)