What should be the correct behavior of a "reliable" transmission? (throughput problem)

Hello, everyone.

We’ve been testing the ROS2 and DDS lately and came across the following issue.
First of all, I’m using ROS2 ardent with OpenSplice DDS. But I believe the behavior can also be apply to newer version and other DDS vendors. Please correct me if I’m wrong.
Test source: https://github.com/EwingKang/adlink_ros2_qos_test
Related article: https://index.ros.org/doc/ros2/About-Quality-of-Service-Settings/

TL;DR: We cannot guarantee the arrival of topics even with “reliable” QoS

Full version:
We I’m running the throughput testing tools (link above), I’ve discovered that the publishing rate is significantly higher. In fact, I can publish at rate of over 85000 Hz with 4KB payload setup (that is, 332MB/s). While the subscribing node only got average of 1000 Hz, and the rate is very unstable. Originally I thought it was the problem of the QoS setting, but it turns out the “reliable” QoS is already the default.
After some weeks long investigation, we’ve finally discovered the root cause. To our understanding, because the default history QoS is set to KEEP_LAST, the DDS system will discards the data that’s already in the buffer, which is not yet taken by the rmw layer, and replace it with the newest “LAST” data sample.

However, to achieve throughput measurement, I want every bit of my data to arrive at the destination.
To achieve that, I modify the node with history QoS=KEEP_ALL, plus
adding modification to the rmw_publisher.cpp, (line:153) and rmw_subscription.cpp, (line:155) with

// necessary so the memory won't burst within seconds
datawriter_qos.resource_limits.max_samples=100;
datareader_qos.resource_limits.max_samples=100;

respectively (see later notes).
So when the reader buffer is full, the writer will be blocked by the reader, and thus every sample will arrive at the subscribing side. About the detail of this writer blocking behavior, I’ve been studying the Vortex OpenSplice C++ Reference Guide (link to all OSPL docs), p.347.
With this setting, I can get synchronized publishing/subscribing rate of >15,500 Hz, and that’s more than 60MB/s!!!

So my question is, how should I view this result? Yes, the DDS did “reliably” sent the data over to the reading entity (at least that’s what they claim). It’s just that we can’t poll the datareader fast enough so that the “KEEP_LAST” setting wipe out the existing data.
However, from the user stand point, this might seems a little weird: “reliable” QoS doesn’t mean 100% data reception. Should we somehow change the ROS2 definition and behavior? Personally, I would agree with default to KEEP_ALL with a default DDS buffer size, say, 100. So when the internal buffer is full, the writer function will block to make sure no data is dropped. And this should be defined as “reliable” from the perspective of ROS2.

Appreciate any opinion, and please correct me if I’m wrong at any perspective.
Thanks.

EDIT: The qos addition is a little brutal, modify the qos.cpp should be a much proper way of doing it.

Setting the RELIABILITY QoS to Reliable only means that data is resent when the receiver fails to receive it. The behavior for how to do queueing when the receiver is slower than the sender is all in the history settings, as you’ve pointed out.

In ROS 1 the behavior is always like KEEP_LAST with a depth (in ROS 1 it’s called queue_size which is a required argument when creating subscribers and publishers), for example see what happens in ROS 1 when the out going buffer is full:

Most people just make larger queues when this is an issue.

I think it’s for this reason we decided to use KEEP_LAST and a depth (believe the default is 10 or 100), as our default, in order to emulate the ROS 1 default behavior as closely as possible.

KEEP_ALL also has limits, but unlike KEEP_LAST's depth, which always applies to each “instance” of samples, KEEP_ALL has two resource limits which always apply, max_samples and max_samples_per_instance. You could just set max_samples_per_instance to the same value as depth and then ideally KEEP_ALL and KEEP_LAST would only affect the strategy used when the queue is full, but you also need to set the max_samples to a reasonable value as well. When using KEEP_ALL you probably also want to set max_instances to a reasonable value, and so using KEEP_ALL is just more complicated.

I think ultimately the user needs to understand the difference in KEEP_LAST and KEEP_ALL, and I think that departing from the terminology used by DDS (to your suggestion “Should we somehow change the ROS2 definition and behavior?”) is probably just going to be confusing. I also think that the default of KEEP_LAST is what ROS 1 users will be expecting by default, and I think that it’s “safer” in the sense that the publisher is less likely to be negatively impacted by a misbehaving subscriber (e.g. a visualization tool over a lossy link) and therefore KEEP_ALL is more situational (like for use in services). For those reasons I think KEEP_LAST makes sense as the default, but it’s definitely a subjective and opinionated position for the ROS 2 API to take.

@wjwwood, I really appreciate your detailed explanation of what’s going on.
It makes a lot of sense if you compare the behavior to ROS1, that the publisher will not care if the receiver gets the data or not.

I fully agree that we should make it clear for the user. The tutorial page About Quality of Service Settings did caught me off guard for a while since it states Reliable: guarantee that samples are delivered, may retry multiple times. and that “delivered” did not means the delivering for the end-to-end ROS publisher/subscriber, but actually means the operation of underlying DDS. This would be a easy fix of wiki on that regards.

Indeed, in this KEEP_LAST style, no publisher will be affected by the broken subscriber and probably will make most scenarios easier (unless you’re measuring throughput :slight_smile:). However, I did find the current system default history.length is set to 0 according the the types.h. I couldn’t find the meaning of setting this value to 0?

As far as the QoS configuring goes, I thought we can only consider max_samples since ROS2 rwm implementation is currently keyless (ans thus have only one instance) AFAIK. The definition I found (as quoted below) stated that KEEP_ALL will not take history length into consideration. According to my experiment, the memory access will kept expand very quickly until the computer dies. This match the behavior stated in the DDS manual that max_samples in both durability_service and resource_limits are set to LENGTH_UNLIMITED. I propose we can expose the resource_limits.max_samples through RMW as well as provide limited value for any preset that involve KEEP_ALL history. And this behavior should also be properly documented so the user knows what they’re doing when adjusting the configuration.

Again, thanks for your reply. Any other opinions are much welcome.

P.s. The definition I found in the OSPL DDS document

KEEP_ALL_HISTORY_QOS - all samples are stored, provided, the resources are available. On the publishing side, the Data Distribution Service will attempt to keep all samples (representing each value written) of each instance of data (identified by its key) managed by the DataWriter until they can be delivered to all subscribers. On the subscribing side, the Data Distribution Service will attempt to keep all samples of each instance of data (identified by its key) managed by the DataReader.
These samples are kept until the application “takes” them from the Data Distribution Service via the DataReader::take operation. The setting of depth has no effect. Its implied value is LENGTH_UNLIMITED. The resources that the Data Distribution Service can use to keep this history are limited by the settings of the ResourceLimitsQosPolicy.
If the limit is reached, the behaviour of the Data Distribution Service will depend on the ReliabilityQosPolicy. If the ReliabilityQosPolicy is BEST_EFFORT_RELIABILITY_QOS, the old values are discarded. If ReliabilityQosPolicy is RELIABLE_RELIABILITY_QOS, the Data Distribution Service will block the DataWriter until it can deliver the necessary old values to all subscribers.

Sorry for the long delay.

For now that is true, but we’re currently working on supporting IDL 4.2 directly in ROS 2, at which point you may specify a key and it will influence queuing as separate instances for each key value.

I agree, we need to expose more QoS to make KEEP_ALL more usable. Any help you (or others) could provide on this in the form of pull requests would be appreciated.