Bad networks dragging down localhost communication?!

Today I had a first-hand experience of a communication problem that may not have much visibility, yet, but which could be very serious: If confirmed, it would mean that a bad network link can drag down all communication, not just that which is going over the link. I recently heard about this from a colleague, and now I could verify it.

The situation was like this: A robot whose entire functional software runs on one machine, so that all communication necessary for the actual function of the robot is local (but not intraprocess). Add to this rviz2 and some other rqt_tools running on an external laptop. This is probably a pretty common development setup.

Now, whenever the wifi coverage became spotty, the robot started to behave erratically! I’m not even talking serious loss – pings became quite a bit longer (up to 1s), but I saw at most 10% packet loss between the robot and the developer laptop. But the robot experienced major delays in message delivery, and the rates on many topics dropped seriously on the robot!

I would have expected my rviz2 to get updated with delay, but I certainly did not expect that the data rates on the robot would be affected.

Now, I would have questioned my setup or my sanity at this point, but for the fact that just a few days ago, a colleague told me about a similar issue. He guessed that it has to do with some communication shaping going on in the DDS layer, but we both didn’t know enough about that to know for sure.

Therefore I wanted to raise this issue here, to get some feedback by others on whether they have also seen this, and maybe get feedback by the DDS vendors on (I’ve used FastDDS, my colleague also tried CycloneDDS) on what could be done about this. I’m using ROS2 Foxy.

7 Likes

Further to that, I noticed about 2 months ago when I was trying to work on a train that I couldn’t get ROS2 to run at all if I wasn’t connected to a network (!!). Without a wifi or LAN connection, it just failed to run. I’m sure there’s a setting in the DDS configs for that, but that should be handled out of the box.

Your issue reminded me of ros2 on embedded board without any network connection on ROS Answers.

Not sure whether that’s still relevant, as the Q&A is from 2019, but still.

I’ve seen this issue myself with RTI Connext DDS, although it was many years ago. On Linux if only the loopback interface was available, DDS just plain wouldn’t work. I have no idea if it’s been fixed now.

I have also seen that in the past, but it seems that’s about discovery having a problem there, and I think @gavanderhoorn’s reference should help there, though I haven’t tried it.

In the situation I’m describing, however, discovery has already taken place and the system is running, but when I fire up rviz2 over a slow link, the throughput, delay and rate for the other connections gets worse as well. Once I close rviz2, it goes back to normal.

1 Like

We have seen this a couple of times with ROS 1 boxes that were running completely local ROS-wise but would have 10s or more delay with service calls when a low quality wireless dongle was attached for remote support (TeamViewer). I was really hoping ROS 2 would solve this kind of stuff :unamused:

1 Like

I’m curious if the same phenomenon can be observed:

  1. With a pure DDS application; have some simple DDS application running locally exchanging data between two participants in separate processes, and them fire up a DDS reader over the slow link and see if the local application’s data rate goes down.
  2. With a pure UDP or TCP application. Do the same as the above, but using raw sockets.

If you have time to try those two cases, it would help isolate the cause to the networking stack (seems unlikely to me), DDS, or ROS 2.

Depending on the DDS in use, the local-machine communication may be going via shared memory. I don’t know why adding in a slow network connection would slow that down, but maybe it’s relevant? Perhaps having to buffer the network connection’s copy of the data causes the thread taking care of sending to slow down and that slows down how fast it can write into shared memory? I’m totally speculating now. :stuck_out_tongue:

1 Like

There was someone else reporting exactly this on ROS answers in september 2020: Bad performance of ROS2 via Wifi - ROS Answers: Open Source Q&A Forum.

1 Like

Not trying to hijack the ROS2 issue, but similar problems on ROS1 might indicate a search-direction.

Whenever my internet connection goes out, local ROS1 performance becomes poor. Even things like rostest (which shouldn’t really communicate outward) don’t succeed locally.

Having no internet connection will make DNS fail in most cases. Since ROS1 relies on DNS unless told not to, that case is easy to explain and address (use IPs or a local DNS).

The slowdown issue observed now is different and, I suspect, DDS specific.

I guess I have to follow @gbiggs suggestion to debug this.

We have lots of robots in our DARPA SubT Challenge team running ROS 1 which get completely isolated for some time (when they get deep in the tunnels without any buddies around), and we never had a problem with delays for local nodes.

I think I have some clues regarding this, at least for Fast DDS, but I suppose the same would be true for other vendors.

When the remote rviz2 subscriptions are discovered, RTPS datagrams will start being delivered to the remote address. If the interface link is intermittent, they will be queued on the socket buffer, and the calls to the OS send_to function will most likely start blocking.

This behavior could be checked with a pure UDP application as @gbiggs suggested, taking care that the application simulates what the DDS vendor is doing, which is using a single socket and perform send_to calls to different destinations.

read_config(&local_address, &remote_address)
s = open_socket()
send_to_remote = false
while (!app_should_finish()) {
    if (remote_toggle_key_pressed()) {
        send_to_remote = !send_to_remote
    }
    data = prepare_data()
    send_to(s, local_address, data)
    if (send_to_remote) {
        send_to(s, remote_address, data)
    }
}
close_socket(s)

Things that could be tried (either alone or combined):

4 Likes

Thanks for the suggestions.

I’m thinking that reproducing the case of @Ingo_Lutkebohle is relatively easy when subscribing to a high resolution image topic, since it requires high bandwidth it’ll show the slowdown more quickly than with something smaller like /cmd_vel. I have such a setup where I have an image topic published by a node that also writes the frame rate of the loop to logs every 10 seconds.

  • Increasing the socket buffer:
    This fixes the case of (brief) intermittent high latency / low throughput, right? It mainly delays the moment that send_to will hang due to a full socket buffer as far as I understand.
    If anything at all, this suggestion seemed to only delay the issue.
  • Setting the UDP transport as non-blocking:
    This looked like the most promising change to me. However configuring it through an XML file and the FASTRTPS_DEFAULT_PROFILES_FILE environment variable didn’t solve the issue. I have also tried setting RMW_FASTRTPS_PUBLICATION_MODE to SYNCHRONOUS as well, but that didn’t solve the issue either.
  • Use a bridge application:
    This works even without different domains. Having another node B on the local system subscribe to the original image topic /camera_a, let that node publish the same message again on a topic /camera_a_mirror and then using rviz to visualize that /camera_a_mirror topic prevents the slowdown on /camera_a itself.
    I feel that this use case is quite common for anyone developing mobile robots. If a bridge application is required to make things run predictably, maybe there should be some node to generically mirror topics?