Hi all,
im just curious on this, i can see some vendors working on dds implementation.
but is there any dds implementation supports shared memory? or working?
thanks in advance,
Tomoya
ROS Resources: ROS Homepage | Media and Trademarks | Documentation | ROS Index | How to Get Help | Q&A Help Site | Discussion Forum | Service Status |
Hi all,
im just curious on this, i can see some vendors working on dds implementation.
but is there any dds implementation supports shared memory? or working?
thanks in advance,
Tomoya
A question like this is better for answers.ros.org, but I’ll give you a quick answer here.
thanks, i will look into them.
confirmed that RTI Connext DDS used shared memory, actually it maps shm in the process space. but so far our internal performance test tells us it is not so good for latency. connext dds implementation is provided as binary so we are not sure what’s going on. is there any specific room we should have this conversation. maybe just asking for RTI’s help?
What sort of latency are you seeing? Could the data marshalling and unmarshalling cover it?
sorry to be late to get back here,
Publisher:Subscriber=1:1, skylake, Ubuntu16.04
msg size [KB] | Latency [msec] |
---|---|
4 | 0.1972224281 |
64 | 1.5988755584 |
256 | 6.1639215946 |
2048 | 59.9750656127 |
8192 | 201.675012207 |
i was expecting much faster since it uses shared memory.
(*) Latency…(end - start), start is right before publish msg, end is subscriber callback fired. so this is just for latency for communication.
we are considering we should use https://github.com/ApexAI/performance_test instead of ours, we will check how it works.
@tomoyafujita I’d recommend checking out the run_experiment.py script if you want to run a comprehensive batch of experiments with performance_test
.
thanks for the tip in advance, will check that out.
we did try out performance_test, the result comes up the following.
skylake ubuntu16.04 Pub:Sub=1:1, QoS(BestEffort, Volatile), Latency Mean [ms]
Test Case | fastrtps | connext |
---|---|---|
Array1k | 0.621794 | 0.501876 |
Array4k | 0.619314 | 0.612616 |
Array16k | 0.626285 | 0.997663 |
Array32k | 0.65806 | 0.888612 |
Array60k | 0.707112 | 5221.432597 |
Array1m | 0.69115 | 29057.71732 |
Array2m | 3.332739 | 0 (valid_num:0) |
not sure what’s going on with connext more than 32KB, and there is no way to investigate since it’s binary release. but if we use the shared memory precisely, this will be much much faster and also throughput. is there specific configuration required to connext? maybe we are not using the precise configuration? someone could you give us some help here?
thanks,
@tomoyafujita DDS configurations is a rather large topic, and would be difficult to even introduce with all the appropriate context in a single post.
You’re right that using shared memory is going to improve ECU- or PC-local latency performance. There are other considerations to make WRT message size, such as packet fragmentation and how that affects the network layers, and network buffer sizes – this is more applicable when you’re not using shared memory features, which may help explain some of the results you posted for larger message sizes.
The Array60k
numbers may not be too far off from expected due to the packet fragmentation and serialization/deserialization delays, which are especially troublesome with nested data structures. I cannot confidently say the same about the Array1m
results, those seem off to me from a cursory review.
Were you able to rerun the performance test experiments with shared memory enabled?
appreciate for your help and comments on this thread, i understand what you mean that taking care of protocol layer and network, that are the difficulties to adjust for shared memory transport.
not sure if we could actually use shared memory or not, as far as i can tell is that shared memory is mapped into the process space, but taking really long time to transmit.
Continental published eCAL this week - a publish-subscribe framework that is designed for high data throughput with minimal latency.
It‘s using shared memory for inter process communication and can use different serialization protocols like google protobuf, flatbuffers or capnproto as well for highest performance.
Check https://github.com/continental/ecal
Hi Tomoya, ADLINK Technology’s OpenSplice DDS has shared memory support for dashing for significant reduction in latency for ROS2 modules running on the same compute, with especially large improvement when doing 1:many. Download here, use ros2/rmw_opensplice. Would be happy to help.
Sorry to jump in here so late… I think there are some known inefficiencies in the way the ROS/rmw is using RTI Connext DDS in that it makes some extra memory copies and allocations.
RTI’s public benchmarks produce much better performance over shared memory. You can see the results here: https://www.rti.com/products/benchmarks.
Message Size Latency
256 B 0.040 msec
1 KB 0.054 msec
2 KB 0.085 msec
4 KB 0.123 msec
16 KB 0.197 msec
32 KB 0.343 msec
64 KB 0.615 msec
You can reproduce this in your own platform using the open source RTI perftest tool: https://github.com/rticommunity/rtiperftest
you can also download the binary distribution here: https://community.rti.com/downloads/rti-connext-dds-performance-test
To send data larger than 64KB you need to enable asynchronous publishing. The RTI perftest will do that automatically for you. You can look at the source code of perftest to see how to do it.
thanks for the information, OpenSplice DDS implementation is provided as binary? it has to be commercial version once it comes to the actual product, am i right? (i do not read the license yet, though) besides, do you have any benchmark result with comparison?
thanks,
hi Tomoya, OpenSplice is open source and commercial. Both work with dashing and use same rmw. Shared memory is a feature of commercial which is binary, other is source.
thanks, any benchmark result then?
We have some briefly benchmark result for your reference.
OpenSplice community v.s OpenSplice commercial
Latency test:
OpenSplice community edition + ROS2 rclcpp + rmw_opensplice_cpp :
1k array: ~273 us
4k array: ~299 us
OpenSplice commercial edition + ROS2 rclcpp + rmw_opensplice_cpp + shared memory :
1k array: ~105us
4k array: ~129us
Throughput test:
Throughput can up to 1.7Gbps.
For example: a 8K array ROS message publishing / subscribing rate can up to ~25200 hz
Notice…
thanks for the information, that’s really interesting!!!
BTW, May I just ask a few questions based on the result?
thanks in advance, cheers
What was it tuned for?