Is there any DDS implementation using shared memory inside system?

Hi all,

im just curious on this, i can see some vendors working on dds implementation.
but is there any dds implementation supports shared memory? or working?

thanks in advance,
Tomoya

5 Likes

A question like this is better for answers.ros.org, but I’ll give you a quick answer here.

  • The RTI Connext DDS implementation definitely has the ability to use shared memory for DDS clients running on the same computing node, and as I recall will do so automatically. However, it will still marshal data because it needs to do so for many of the features of Connext DDS, such as logging and introspection.
  • eProsima’s FastRTPS apparently does not use shared memory yet but it is on their roadmap.
  • OpenSplice DDS does use shared memory internally. I don’t know if this implementation marshals data for shared memory.
2 Likes

@gbiggs

thanks, i will look into them.

confirmed that RTI Connext DDS used shared memory, actually it maps shm in the process space. but so far our internal performance test tells us it is not so good for latency. connext dds implementation is provided as binary so we are not sure what’s going on. is there any specific room we should have this conversation. maybe just asking for RTI’s help?

What sort of latency are you seeing? Could the data marshalling and unmarshalling cover it?

1 Like

@gbiggs

sorry to be late to get back here,

Publisher:Subscriber=1:1, skylake, Ubuntu16.04

msg size [KB] Latency [msec]
4 0.1972224281
64 1.5988755584
256 6.1639215946
2048 59.9750656127
8192 201.675012207

i was expecting much faster since it uses shared memory.

(*) Latency…(end - start), start is right before publish msg, end is subscriber callback fired. so this is just for latency for communication.

we are considering we should use https://github.com/ApexAI/performance_test instead of ours, we will check how it works.

@tomoyafujita I’d recommend checking out the run_experiment.py script if you want to run a comprehensive batch of experiments with performance_test.

@lyle

thanks for the tip in advance, will check that out.

1 Like

we did try out performance_test, the result comes up the following.

skylake ubuntu16.04 Pub:Sub=1:1, QoS(BestEffort, Volatile), Latency Mean [ms]

Test Case fastrtps connext
Array1k 0.621794 0.501876
Array4k 0.619314 0.612616
Array16k 0.626285 0.997663
Array32k 0.65806 0.888612
Array60k 0.707112 5221.432597
Array1m 0.69115 29057.71732
Array2m 3.332739 0 (valid_num:0)

not sure what’s going on with connext more than 32KB, and there is no way to investigate since it’s binary release. but if we use the shared memory precisely, this will be much much faster and also throughput. is there specific configuration required to connext? maybe we are not using the precise configuration? someone could you give us some help here?

thanks,

1 Like

@tomoyafujita DDS configurations is a rather large topic, and would be difficult to even introduce with all the appropriate context in a single post.

You’re right that using shared memory is going to improve ECU- or PC-local latency performance. There are other considerations to make WRT message size, such as packet fragmentation and how that affects the network layers, and network buffer sizes – this is more applicable when you’re not using shared memory features, which may help explain some of the results you posted for larger message sizes.

The Array60k numbers may not be too far off from expected due to the packet fragmentation and serialization/deserialization delays, which are especially troublesome with nested data structures. I cannot confidently say the same about the Array1m results, those seem off to me from a cursory review.

Were you able to rerun the performance test experiments with shared memory enabled?

1 Like

@lyle

appreciate for your help and comments on this thread, i understand what you mean that taking care of protocol layer and network, that are the difficulties to adjust for shared memory transport.

not sure if we could actually use shared memory or not, as far as i can tell is that shared memory is mapped into the process space, but taking really long time to transmit.

Continental published eCAL this week - a publish-subscribe framework that is designed for high data throughput with minimal latency.
It‘s using shared memory for inter process communication and can use different serialization protocols like google protobuf, flatbuffers or capnproto as well for highest performance.
Check https://github.com/continental/ecal

Hi Tomoya, ADLINK Technology’s OpenSplice DDS has shared memory support for dashing for significant reduction in latency for ROS2 modules running on the same compute, with especially large improvement when doing 1:many. Download here, use ros2/rmw_opensplice. Would be happy to help.

1 Like

Sorry to jump in here so late… I think there are some known inefficiencies in the way the ROS/rmw is using RTI Connext DDS in that it makes some extra memory copies and allocations.

RTI’s public benchmarks produce much better performance over shared memory. You can see the results here: https://www.rti.com/products/benchmarks.

Message Size    Latency
256 B            0.040  msec
  1 KB           0.054  msec
  2 KB           0.085  msec
  4 KB           0.123  msec
 16 KB           0.197  msec
 32 KB           0.343  msec
 64 KB           0.615  msec

You can reproduce this in your own platform using the open source RTI perftest tool: https://github.com/rticommunity/rtiperftest
you can also download the binary distribution here: https://community.rti.com/downloads/rti-connext-dds-performance-test

To send data larger than 64KB you need to enable asynchronous publishing. The RTI perftest will do that automatically for you. You can look at the source code of perftest to see how to do it.

@joespeed

thanks for the information, OpenSplice DDS implementation is provided as binary? it has to be commercial version once it comes to the actual product, am i right? (i do not read the license yet, though) besides, do you have any benchmark result with comparison?

thanks,

1 Like

hi Tomoya, OpenSplice is open source and commercial. Both work with dashing and use same rmw. Shared memory is a feature of commercial which is binary, other is source.

1 Like

@joespeed

thanks, any benchmark result then?

@tomoyafujita

We have some briefly benchmark result for your reference.
OpenSplice community v.s OpenSplice commercial

Latency test:
OpenSplice community edition + ROS2 rclcpp + rmw_opensplice_cpp :
1k array: ~273 us
4k array: ~299 us

OpenSplice commercial edition + ROS2 rclcpp + rmw_opensplice_cpp + shared memory :
1k array: ~105us
4k array: ~129us

Throughput test:
Throughput can up to 1.7Gbps.
For example: a 8K array ROS message publishing / subscribing rate can up to ~25200 hz

Notice…

  1. all latency and throughput tests use optimized build .
  2. all message array memory are pre-allocated before sending to the rmw layer.
  3. all test use a tuned opensplice configuration. Different configuration could have different performance.
2 Likes

@cwyark

thanks for the information, that’s really interesting!!!

BTW, May I just ask a few questions based on the result?

  • what is the platform device and ROS2 version?
  • How many subscribers on the Reader side, if not a problem?
  • Is there any chance to support much bigger data, such as MB order images?

thanks in advance, cheers

1 Like

What was it tuned for?

1 Like