ROS2 QoS Reliability Issue

ROS2 QoS Reliability Issue

tl;dr: using tools such as RViz over WiFi can cause a robot to stall, even when all topics are subscribed using the “best_effort” policy - seemingly it’s functionally equivalent to the “reliable” policy

Introduction

Mobile robots are very popular in the ROS community, and they necessitate wireless communication for real-time monitoring and debugging. Tools such as RViz are commonly used over a WiFi connection to visualize robot sensor data, path-planning routes, etc. Unlike fixed robots connected via a wired connection where the bandwidth and latency are fairly constant, the bandwidth and latency on a WiFi between the mobile robot and a developer’s workstation can vary considerably and often unpredictably as a function of the robot and the developer’s relative positions.

This shouldn’t be a issue, as the developer’s view of the robot’s own perspective on the world is supplemental to the robot’s operation - not critical. This is what the ROS2 topic reliability policies were designed for, a reliable subscription policy that can be used for mission-critical connections, possibly internal to the robot, and a “best_effort” subscription policy that can be used for non-critical monitoring and debugging.

The ROS2 documentation on QoS polices states that “ROS 2 can be as reliable as TCP or as best-effort as UDP”.

As will be demonstrated however, there is seemingly no functional difference between the two policies - both act as a reliable subscriber, meaning that the publisher will throttle its publishing rate if the subscriber can’t keep up, which is often the case for high-bandwidth topics over a wireless connection.

This means that for mobile robots, where engineers use tools such as RViz monitoring topics over a WiFi link, should that link degrade, the internal communication on the robot also degrades, even when using shared memory! As such, the behaviour of the robot as a whole under test, and in the field, could be drastically different.

For us this often causes our robot to grind to a halt if we’re trying to debug the robot’s navigation using RViz and we loose line of sight. This could be alleviated by only using ROS2 DDS for internal communication on the host, and a ROS bridge to external devices. But the the benefits of DDS are lost.

Better WiFi coverage and higher bandwidth APs could also alleviated the issue somewhat, but still offers no guarantees, and in principle should not have an effect at all - such non-determinism is a non-starter in many industrial applications.

Reproduction

The reproducible demo for convenience uses containers to simulate two remote hosts, but I’ve also reproduced this issue outside of containers between two real remote hosts, and that both the FastDDS and Cyclone Iceyorx shared-memory implementations are also affected by slow remote subscribers.

Prerequisites

Webcam at /dev/video0 and Podman installed.

Download and build

git clone https://github.com/ciandonovan/ros2_publish_rate.git
cd ros2_qos_reliability
podman build -t $(basename $PWD) .

Run

export RMW_IMPLEMENTATION=rmw_fastrtps_cpp or export RMW_IMPLEMENTATION=rmw_cyclonedds_cpp

Simulate a constrained WiFi network on the loopback interface.

sudo tc qdisc add dev lo root tbf rate 10mbit burst 10mbit latency 50ms

N.B. use sudo iftop -i lo to monitor loopback traffic in real-time.

Run ROS2 camera publisher in one terminal.

podman run --rm -it --name ros2_publisher --net=host --device=/dev/video0:/dev/video0:rw --env RMW_IMPLEMENTATION ros2_qos_reliability:latest ros2 run v4l2_camera v4l2_camera_node --ros-args --log-level debug

Run ROS2 camera subscriber in another.

podman run --rm -it --name ros2_subscriber --net=host --env RMW_IMPLEMENTATION ros2_qos_reliability:latest ros2 topic echo --qos-reliability best_effort /image_raw

Confirm publisher and subscriber reliability settings

podman exec -it --env RMW_IMPLEMENTATION ros2_publisher /ros_entrypoint.sh ros2 topic info --verbose /image_raw

Delete tc qdisc on the loopback when finished.

sudo tc qdisc del dev lo root

Comments

The “reliable” subscription behaves as expected, the publisher throttles its publishing rate so until the point where all subscribers can receive all published messages given the bandwidth. This can be seen by observing the publisher’s debug messages.

However, this is also the case for the “best_effort” subscription.

@smac @Jaime_Martin_Losa

Off-topic, so collapsed by default

While I’m the first to be critical of things – just see my responses to similar issues discussed here in the past – a less bombastic post might have gotten the point across equally well.

A reproducible example and a succinct description of observation(s) should’ve sufficed.

Reason I post this is the fact how we interact with each other in an OSS community greatly influences our (and other’s) motivation, willingness to help and overall experience. If you’re looking for help, delivery of the message is as important as its content.

Note: I’m not dismissing your experience, or work. I just wanted to draw attention to some “non-functional” aspects here.

2 Likes

@ciandonovan

I have a couple of comments, i might be mistaken…

there is seemingly no functional difference between the two policies - both act as a reliable subscriber

i think this is the situation that you see with your environment? i am not even sure what this means…

using tools such as RViz over WiFi can cause a robot to stall

this is the problem that you want to solve?

Does it make sense if we have following feature to address this issue?

Rate Control Subscription

Rate Control Subscription can provide rate control on subscription side, even if many publishers are sending with high frequency rate.

The expected use case is that publishers are come from 3rd party application, then we cannot control the rate from subscriber.

This feature controls that subscriber application take the execution time for itself to keep the rate to take the message or issue the application callback based on the user specified rate on that subscription.

I introduced this feature at ROSCon JP 2022 Sony Keynote page 17, only this feature name.

more details for DDS are following.

Time Based Filter

This policy allows a DataReader to indicate that it does not necessarily want to see all values of each instance published under the Topic.

Rather, it wants to see at most one change every minimum_separation period.

Description

This is one of the QosPolicy TIME_BASED_FILTER that user application can apply it on DataReader.

Filter that allows a DataReader to specify that it is interested only in (potentially) a subset of the values of the data.

The filter states that the DataReader does not want to receive more than one value each minimum_separation, regardless of how fast the changes occur.

It is inconsistent for a DataReader to have a minimum_separation longer than its DEADLINE period.

By default minimum_separation=0 indicating DataReader is potentially interested in all values.

This QoS policy allows you to optimize resource usage (CPU and possibly network bandwidth) by only delivering the required amount of data to different DataReaders.

Details

  • minimum_separation parameter can be changeable during runtime.

  • This setting allows a DataReader to further decouple itself from the DataWriter objects. This means that even multiple DataWriter exists minimum_separation is decoupled from them.

Other Related Note

  • It is inconsistent for a DataReader to have a DEADLINE period less than its TIME_BASED_FILTER’s minimum_separation.

  • This does not make sense, so disallow this to be set is fine. So that QoS inconsistency event will be notified for user application.

  • RTI Connext DDS supports this feature.

  • DDS v1.4 specification does not explicitly describe either DataWriter or DataReader is responsible this filtering process. That means this leaves room for implementation specification.

2 Likes

Hi Tomoya,

The problem is that currently it seems that a ROS2 publisher will only publish at the rate at which its lowest common denominator subscriber can receive - regardless the reliability policy (“reliable” or “best_effort”).

The ROS2 QoS documentation defines the policy as

Best effort: attempt to deliver samples, but may lose them if the network is not robust.

and also implies that a publisher can be configured to act as a connectionless UDP type, where there is no rate control imposed by consumers. However I could be misunderstanding the role and function of the “best_effort” policy.

One of the exciting promises of ROS2 is the ability to construct real-time systems. The use of shared-memory for zero-copy message passing by the DDS middleware should allow constant-time ingestion of sensor data from separate nodes.

But currently, if a user were to fire up RViz and subscribe to a topic with which the bandwidth of the link between them can’t suffice, then that deterministic contract on the robot is broken, the publisher will throttle the publishing of the sensor data to satisfy the remote subscriber, despite that not being an internal limitation, and the behaviour of the robot will be altered.

I’ve reproduced this behaviour on ROS2 Foxy, Galactic, Humble, and the current Rolling, across both FastDDS and CycloneDDS, with both shared memory and over IP.

Maybe Rate Control Subscription could alleviate this problem somewhat. Say that a publisher was publishing at a rate of 30 Hz, a subscriber requested 10 Hz, what would happen in the case where the bandwidth was only sufficient for 5 Hz? Would the publisher throttle, or would the subscriber just drop messages?

While it’s not OotB behaviour, your post reminded me of RMW Fast DDS Publication Mode: Sync vs Async and how to change it.

As @Jaime_Martin_Losa mentions in that thread (async publishing is the default in many (all?) RMWs (or synchronous, it gets confusing)):

Yes I came across that before and tried both a synchronous and asynchronous transport, but unfortunately it didn’t make a difference.

I mainly use CycloneDDS so I’m less familiar with the FastDDS XML configuration, it’s possible I didn’t configure it correctly.

My suspicion though is that the issue is arising in the ROS2 client libraries (rclc, rclcpp, rclpy), and is blocking for message delivery, or doesn’t have separate queues for different subscribers?

  • Reliability QoS Policy: this policy specifies the delivery of the samples. If set to BEST EFFORT, the sample is sent only once and no confirmation of reception is expected whereas if this QoS is set to RELIABLE, a confirmation for all samples is expected. This latter configuration may be troublesome when sending large samples that need fragmentation because the sample’s lost fragments would have to be delivered again (with the associated network load). Without receiving the acknowledgement, the RELIABLE History is not allowed to dispose of samples.

This description of the BEST EFFORT policy sounds ideal, but not what I’m currently able to observe in my tests.

some messages will be dropped. this is simple, we just dont have enough resource to handle all messages.

1 Like

Or the publisher could throttle sending until the subscriber can keep up, as is happening now. Obviously data throughput will decrease, but messages won’t be dropped, the problem is that there’s seemingly only one queue for all subscribers, and so they’ll all suffer.

I do not understand how we can send the message 30Hz over 5Hz bandwidth? i may be mistaken?

In case 2, camera frames are dropped, but no ROS2 “messages” are dropped, because the publisher has throttled until it’s only generating messages at a frequency that all can be delivered across the slowest subscriber link. However, this is also impacting the nav2 node despite the fact it’s able to use zero-copy shared memory with the v4l2 node.

1 Like

@ciandonovan thank you so much for taking time to illustrate this. that makes it easier to understand for me.

and we expect that communication between v4l2 node and nav2 node should be 30Hz even with RViz is connected in case 2, right? i think this is the problem.

Do you happen to have related issue on ROS 2 · GitHub? i think it would be better to open the new issue ticket to track this.

2 Likes

No problem! Was also a nice excuse to use the pen on my old Thinkpad x230t :slight_smile:

I had planned to make an issue somewhere on Github, but wanted to get a broader opinion here first, as I didn’t know whether to file it under one of the ROS Client Libraries, ROS DDS layers, or the DDS repos themselves.

I can submit an issue at the general ROS 2 · GitHub if that works?

1 Like

and yes that’s exactly right, in an ideal scenario a “best_effort” subscriber would not be able to have any detrimental impact on a “reliable” one.

Tracking this issue on GitHub here QoS Best Effort Reliability Issue · Issue #1434 · ros2/ros2 · GitHub

Slightly off topic. If you have periodic data being published as from v4L2, then as long as you can tolerate an occasional missed update, I would argue that the publication(from v4L2) and the subscription (in nav2 Node) should both be best-efforts. Reliable data incur a large overhead in DDS compared to best-efforts data and should really only be used for asynchronous data e.g. for commands and state published on change.

Unfortunately, a lot of the data in ROS2 is by default configured as reliable (probably due to historical compatibility with ros1) and is hugely inefficient. There are very few cases of periodic data that need to be reliable. All of our sensor data is best-effort. If an update gets dropped, that’s ok since I know another one is on its way. Anyway, this is a pet peeve of mine with regards to unnecessarily reliable data in ros2.

1 Like

Agree, better to process a more recent camera frame than to block for the retransmission of an old one.

Have you managed to get the best-effort QoS policy working? I’ve mostly tested reliable publishers with best-effort subscribers, which should work but doesn’t, but maybe you might have had more luck with best-effort publishers and subscribers?

Yes, we have working best-efforts data. We typically have reliable publisher to reliable subscriber or best-efforts publisher to best-efforts subscriber. It’s not common we have a reliable publisher to best-efforts subscriber.

Are any of those nodes open-source? I’d like to try reproduce that, I tried before editing the v4l2 camera node to publish best-effort but still ran into the same issue.

Would still be beneficial to get reliable → best-effort working correctly as to not have to reconfigure or recompile many of the existing nodes. Also, in the rare cases where you want both reliable and best-effort subscribers, the publisher must still be reliable. Wouldn’t want a bad WiFi connection to a ROS debugging console to stall an internal state command.

If you’re publishing high-bandwidth data best-effort to a best-effort subscriber, and you throttle the link capacity between, does the publishing rate drop at all, or does the subscriber just loose data?

The latter should be the case, but I’ve yet to be successful reproducing it myself.