Today I had a first-hand experience of a communication problem that may not have much visibility, yet, but which could be very serious: If confirmed, it would mean that a bad network link can drag down all communication, not just that which is going over the link. I recently heard about this from a colleague, and now I could verify it.
The situation was like this: A robot whose entire functional software runs on one machine, so that all communication necessary for the actual function of the robot is local (but not intraprocess). Add to this rviz2 and some other rqt_tools running on an external laptop. This is probably a pretty common development setup.
Now, whenever the wifi coverage became spotty, the robot started to behave erratically! I’m not even talking serious loss – pings became quite a bit longer (up to 1s), but I saw at most 10% packet loss between the robot and the developer laptop. But the robot experienced major delays in message delivery, and the rates on many topics dropped seriously on the robot!
I would have expected my rviz2 to get updated with delay, but I certainly did not expect that the data rates on the robot would be affected.
Now, I would have questioned my setup or my sanity at this point, but for the fact that just a few days ago, a colleague told me about a similar issue. He guessed that it has to do with some communication shaping going on in the DDS layer, but we both didn’t know enough about that to know for sure.
Therefore I wanted to raise this issue here, to get some feedback by others on whether they have also seen this, and maybe get feedback by the DDS vendors on (I’ve used FastDDS, my colleague also tried CycloneDDS) on what could be done about this. I’m using ROS2 Foxy.