Hello, everyone.
We’ve been testing the ROS2 and DDS lately and came across the following issue.
First of all, I’m using ROS2 ardent with OpenSplice DDS. But I believe the behavior can also be apply to newer version and other DDS vendors. Please correct me if I’m wrong.
Test source: https://github.com/EwingKang/adlink_ros2_qos_test
Related article: https://index.ros.org/doc/ros2/About-Quality-of-Service-Settings/
TL;DR: We cannot guarantee the arrival of topics even with “reliable” QoS
Full version:
We I’m running the throughput testing tools (link above), I’ve discovered that the publishing rate is significantly higher. In fact, I can publish at rate of over 85000 Hz with 4KB payload setup (that is, 332MB/s). While the subscribing node only got average of 1000 Hz, and the rate is very unstable. Originally I thought it was the problem of the QoS setting, but it turns out the “reliable” QoS is already the default.
After some weeks long investigation, we’ve finally discovered the root cause. To our understanding, because the default history QoS is set to KEEP_LAST, the DDS system will discards the data that’s already in the buffer, which is not yet taken by the rmw layer, and replace it with the newest “LAST” data sample.
However, to achieve throughput measurement, I want every bit of my data to arrive at the destination.
To achieve that, I modify the node with history QoS=KEEP_ALL, plus
adding modification to the rmw_publisher.cpp, (line:153) and rmw_subscription.cpp, (line:155) with
// necessary so the memory won't burst within seconds
datawriter_qos.resource_limits.max_samples=100;
datareader_qos.resource_limits.max_samples=100;
respectively (see later notes).
So when the reader buffer is full, the writer will be blocked by the reader, and thus every sample will arrive at the subscribing side. About the detail of this writer blocking behavior, I’ve been studying the Vortex OpenSplice C++ Reference Guide (link to all OSPL docs), p.347.
With this setting, I can get synchronized publishing/subscribing rate of >15,500 Hz, and that’s more than 60MB/s!!!
So my question is, how should I view this result? Yes, the DDS did “reliably” sent the data over to the reading entity (at least that’s what they claim). It’s just that we can’t poll the datareader fast enough so that the “KEEP_LAST” setting wipe out the existing data.
However, from the user stand point, this might seems a little weird: “reliable” QoS doesn’t mean 100% data reception. Should we somehow change the ROS2 definition and behavior? Personally, I would agree with default to KEEP_ALL with a default DDS buffer size, say, 100. So when the internal buffer is full, the writer function will block to make sure no data is dropped. And this should be defined as “reliable” from the perspective of ROS2.
Appreciate any opinion, and please correct me if I’m wrong at any perspective.
Thanks.
EDIT: The qos addition is a little brutal, modify the qos.cpp should be a much proper way of doing it.