Unconfigured DDS considered harmful to Networks

Internally we have documentation telling people to configure their machines with these settings:

  1. Save this file to ~/.ros/cyclonedds.xml
<?xml version="1.0" encoding="UTF-8" ?>
<CycloneDDS xmlns="https://cdds.io/config" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="https://cdds.io/config https://raw.githubusercontent.com/eclipse-cyclonedds/cyclonedds/master/etc/cyclonedds.xsd">
  <Domain id="any">
      <General>
          <NetworkInterfaceAddress>lo</NetworkInterfaceAddress>
          <AllowMulticast>false</AllowMulticast>
      </General>
      <Discovery>
          <ParticipantIndex>auto</ParticipantIndex>
          <Peers>
              <Peer Address="localhost"/>
          </Peers>
          <MaxAutoParticipantIndex>120</MaxAutoParticipantIndex>
      </Discovery>
  </Domain>
</CycloneDDS>
  1. Set these environment variables:
RMW_IMPLEMENTATION=rmw_cyclonedds_cpp
CYCLONEDDS_URI=~/.ros/cyclonedds.xml

The reason for this we keep regularly bringing down our network by accidentally running ROS 2 on computers internally without these configs. Practically what this looks like is a bunch of multicast packets flooding wired and wireless networks until they grind to a halt.

The reason I’m making this post, instead of opening a PR somewhere to try to fix this is I am unsure how where to make this PR or how to make this the default behavior of ROS 2. I doubt the DDS vendors want something like this as a default and ROS 2 has no infrastructure to set environment variables or install files into users’ home directories like DDS configs during installation.

I could open a PR to the ros2 documentation to add a warning about this but most users will still discover this the same way we discover it.

Here is one of many memes we’ve made to express our frustration at this default.
image

18 Likes

For the most part, these configuration options have equivalent APIs that can be used to set them. So if we wanted to make these changes the default, we likely could do it in the appropriate RMW implementations by adding in calls to the underlying DDS implementation.

Before doing that, though, I’d definitely want feedback from @eboasson (for a CycloneDDS change), @MiguelCompany (for a Fast-DDS change), and @asorbini_rti (for a Connext change). I think we should be as consistent as possible between the DDS providers, so if we did it in one of them I’d think we’d want to do the same general thing on all.

4 Likes

This is a good point that this shouldn’t be done in a DDS Vendor-specific way. You would want the same consistent (and hopefully good) experience in the default case with any DDS vendor configured.

3 Likes

It would be great to get a Fast DDS equivalent config posted in this thread if anyone has one handy

2 Likes

Is this different from what ROS_LOCALHOST_ONLY does?

3 Likes

I don’t know… but if that setting effectively does this, could we make it the default?

The trade-off with making that the default behavior is that multi-machine communication will not work by default.

It was discussed, I don’t remember what the details were other than that.

Also, I’ve never seen that setting documented anywhere.

3 Likes

Until running ROS 2 with networking enabled is not destructive, could we get it changed temporarily?

You can follow the discussion that was had here: Restrict DDS middleware network traffic to localhost · Issue #798 · ros2/ros2 · GitHub

I don’t think we need to spam discourse with this back and forth (this isn’t a chat client), can you make an issue/pr and link to it on this thread?

It probably does need to be documented, but I think there was some issue with how this interacts with IPv6 and that may have stalled work on documenting it and/or making it the default behavior. It needs people who are affected by it (like yourself) to help push it along.

I’m happy to try to help improve documentation around this. I just don’t think it’ll be much help towards stopping the current destructive behavior from negatively impacting people who both use networks and ROS 2. With the current defaults, it only takes one person on a team to forget or not read the documentation to bring down the network.

I can also respond on GitHub about this in that issue. My issue isn’t as others have stated that you get cross-talk, but simply that by default we seem to get a DDoS attack against our network when people forget.

If discovery could be made non-destructive by default, that would also be an acceptable solution.

This sounds like there’s something specific to your network going on that should be debugged. One user using non-localhost only behaviors does not typically bring down the whole network. As others have mentioned the more common issue is cross talk, not swamping the network. Switching to localhost only only avoids the problem temporarily until you want to use multiple computers which is one of the primary reasons for using a distributed pub-sub architecture. This is something that likely needs to be debugged in an answers.ros.org question or an issue rather than changing the default setting to disable networking by default for everyone.

1 Like

Just FYI, it does. export ROS_LOCALHOST_ONLY=1 disables default built-in transport, and uses 127.0.0.1 network address. (LoanedMessagecan be enabled if rmw supports.)

It probably does need to be documented

3 Likes

how many networks has PickNik had to replace so far, seeing as they’re being destroyed and all? :wink:

Destructive was a bit of hyperbole. The network hardware does resume working after finding which machine is originating all the multicast traffic and unplugging it. Our current best theory why we are suffering from this is that unmanaged switches or something about our wifi networks might amplify the multicast traffic. None of us know enough about networking to have a clear answer to why this is happening. I only posted this because I figured our bad experience was probably not unique. We’ll continue to try to understand this because while in most cases we are not building multi-machine systems with ROS 2, it is something we do sometimes and it would be good if we could enable that feature without making our networks unusable.

I use multi-computer ROS2 on corporate networks, with a complex network infrastructure. We have conflicts between developers and must set domain ID’s to be different, however a bunch of computers each running 10-30 nodes alongside another bunch of network users on the same subnet has caused zero issues so far.

I’d vote for improving documentation for network for using that environment variable rather than making a breaking change to the default to disable networking in ROS2.

5 Likes

We’ve only recently started to look into incorporating ROS into our industrial automation applications, and we ran into a similarly “destructive” issue.

Our experiments have been with FastDDS on Foxy.
The particular scenario that we had was:

  • Windows PC with ROS publishers
    • Network interface with IP on our office network
  • Linux PC with ROS subscribers
    • One real network interface with an IP address on our office network
    • Various other network interfaces created by Docker/VMware with weird local IP addresses

Running this arrangement with default settings took down our office network and saturated our internet bandwidth. Networking is not my strong suit, but my understanding of what happened is:

  • Subscriber sees that the topic it wants is published
  • It sends ALL of its IP addresses to the publisher, as potential destinations
  • The publisher then attempts to publish to all of these addresses
  • The correct one gets through (so everything appears to work)
  • But the incorrect ones have no route (since they were for VMware etc on the sub pc), and so the packets are sent to the router, and out to the internet

In this case we were simulating data from four 64 layer lidars, hundreds of megabits, and it saturated our external bandwidth on and off for a few days until the problem was found (we all just thought the router was on the fritz or something).

As frustrating as it was to take down the whole office internet, the real concern for us here is that our clients are typically operating on limited-bandwidth networks (100Mb or less for control equipment) in remote areas, in safety-critical applications, so we need to exercise caution in the way we transmit data. In the past we have always used TCP as part of our approach, and we hoped that the default UDP implementation in ROS/FastDDS would still be suitable but obviously it wasn’t.

Here is the config (@nathanbrooks) we have been using during testing for FastDDS, but we would be much happier if there was a safer default, since if we accidentally forget to set the environment variable we can cause some serious trouble. (Yes it will normally be done in a script, yes we can put it in our bashrc so that if someone logs in to tweak things their terminal would be ok, but it just doesn’t sit comfortably).

FastRTPS XML
<?xml version="1.0" encoding="UTF-8"?>
<profiles xmlns="http://www.eprosima.com/XMLSchemas/fastRTPS_Profiles">
    <transport_descriptors>
        <transport_descriptor>
            <transport_id>udp</transport_id>
            <type>UDPv4</type>
            <interfaceWhiteList>
                <address>127.0.0.1</address>
                <address>192.168.104.42</address>
            </interfaceWhiteList>
        </transport_descriptor>
    </transport_descriptors>
    <participant profile_name="participant_profile_ros2" is_default_profile="true">
        <rtps>
            <name>profile_for_ros2_node</name>
            <useBuiltinTransports>false</useBuiltinTransports>
            <userTransports>
                <transport_id>udp</transport_id>
            </userTransports>
        </rtps>
    </participant>
</profiles>

I get that it’s the middleware’s problem, perhaps a sensible default there would be for it to pick a single network interface as the default (try to be smart about it), and it will either work or not, rather than appearing to work while causing chaos in the background.

10 Likes

I have seen similar issues, not with every network, but primarily with a wireless mesh network with many wireless APs talking to each other, where ROS2/DDS traffic caused it to grind to a halt.

4 Likes

Quick addition: Technically, export ROS_LOCALHOST_ONLY=1 does not perform all the configuration steps specified in the XML of the original post. It misses the part about <MaxAutoParticipantIndex>120</MaxAutoParticipantIndex>.

Without this setting, one runs into this error as soon as one has more than a few nodes: ros2: Failed to find a free participant index for domain ...

2 Likes

Great point. I wrote up a feature request that would address that here