ROS Resources: Documentation | Support | Discussion Forum | Service Status | Q&A answers.ros.org

Is there any DDS implementation using shared memory inside system?

Hi Tomoya, ADLINK Technology’s OpenSplice DDS has shared memory support for dashing for significant reduction in latency for ROS2 modules running on the same compute, with especially large improvement when doing 1:many. Download here, use ros2/rmw_opensplice. Would be happy to help.

1 Like

Sorry to jump in here so late… I think there are some known inefficiencies in the way the ROS/rmw is using RTI Connext DDS in that it makes some extra memory copies and allocations.

RTI’s public benchmarks produce much better performance over shared memory. You can see the results here: https://www.rti.com/products/benchmarks.

Message Size    Latency
256 B            0.040  msec
  1 KB           0.054  msec
  2 KB           0.085  msec
  4 KB           0.123  msec
 16 KB           0.197  msec
 32 KB           0.343  msec
 64 KB           0.615  msec

You can reproduce this in your own platform using the open source RTI perftest tool: https://github.com/rticommunity/rtiperftest
you can also download the binary distribution here: https://community.rti.com/downloads/rti-connext-dds-performance-test

To send data larger than 64KB you need to enable asynchronous publishing. The RTI perftest will do that automatically for you. You can look at the source code of perftest to see how to do it.

@joespeed

thanks for the information, OpenSplice DDS implementation is provided as binary? it has to be commercial version once it comes to the actual product, am i right? (i do not read the license yet, though) besides, do you have any benchmark result with comparison?

thanks,

1 Like

hi Tomoya, OpenSplice is open source and commercial. Both work with dashing and use same rmw. Shared memory is a feature of commercial which is binary, other is source.

1 Like

@joespeed

thanks, any benchmark result then?

@tomoyafujita

We have some briefly benchmark result for your reference.
OpenSplice community v.s OpenSplice commercial

Latency test:
OpenSplice community edition + ROS2 rclcpp + rmw_opensplice_cpp :
1k array: ~273 us
4k array: ~299 us

OpenSplice commercial edition + ROS2 rclcpp + rmw_opensplice_cpp + shared memory :
1k array: ~105us
4k array: ~129us

Throughput test:
Throughput can up to 1.7Gbps.
For example: a 8K array ROS message publishing / subscribing rate can up to ~25200 hz

Notice…

  1. all latency and throughput tests use optimized build .
  2. all message array memory are pre-allocated before sending to the rmw layer.
  3. all test use a tuned opensplice configuration. Different configuration could have different performance.
2 Likes

@cwyark

thanks for the information, that’s really interesting!!!

BTW, May I just ask a few questions based on the result?

  • what is the platform device and ROS2 version?
  • How many subscribers on the Reader side, if not a problem?
  • Is there any chance to support much bigger data, such as MB order images?

thanks in advance, cheers

1 Like

What was it tuned for?

  • what is the platform device and ROS2 version?

we use Ubuntu 18.04 with ROS2 dashing installed from deb. CPU use Intel core i5.

  • How many subscribers on the Reader side, if not a problem?

For our previous test, we simply use one publisher and one subscriber. We will make some test for multiple nodes in future.

  • Is there any chance to support much bigger data, such as MB order images?

Definitely yes, it should support MB order image, but we haven’t done that tunning yet :slight_smile:

1 Like

It tuned for the minimum latency and maximum throughput. Fragment size, and some memory management parameters in shared memory mode affect the performance.

1 Like

@cwyark

thanks, that helps.

The eCAL project just published performance measures for interprocess communication.

Two new sample applications are provided in the latest release to measure performance on other platforms, we would like to get some measures from the community :slight_smile: .

These are the platforms we used to measure. For details how to measure see ReadMe.md.

-------------------------------
 Platform Windows 10 (AMD64)
-------------------------------
OS Name:                            Microsoft Windows 10 Enterprise
OS Version:                         10.0.16299
OS Manufacturer:                    Microsoft Corporation
OS Build Type:                      Multiprocessor Free
System Manufacturer:                HP
System Model:                       HP ZBook 15 G5
System Type:                        x64-based PC
Processor(s):                       1 Prozessor(s) Installed.
                                    [01]: Intel64 Family 6 Model 158 Stepping 10 GenuineIntel ~2592 MHz
Total Physical Memory:              32.579 MB

-------------------------------
 Platform Ubuntu 16 (AMD64)
-------------------------------
H/W path      Device    Class       Description
===============================================
                        system      HP ZBook 15 G3 (M9R63AV)
/0                      bus         80D5
/0/0                    memory      128KiB L1 Cache
/0/1                    memory      128KiB L1 Cache
/0/2                    memory      1MiB L2 Cache
/0/3                    memory      8MiB L3 Cache
/0/4                    processor   Intel(R) Core(TM) i7-6820HQ CPU @ 2.70GHz
/0/5                    memory      16GiB System Memory
/0/5/0                  memory      8GiB SODIMM Synchron 2133 MHz (0,5 ns)
/0/5/1                  memory      8GiB SODIMM Synchron 2133 MHz (0,5 ns)

Latencies in µs for a single pub/sub connection (20000 samples, zero drops). Shared memory transport layer is used as default for local communication.

|Payload Size (kB)|Win10 AMD64|Ubuntu16 AMD64|
|----------------:|----------:|-------------:|
|               1 |     25 µs |        14 µs |
|               2 |     25 µs |        14 µs |
|               4 |     26 µs |        15 µs |
|               8 |     28 µs |        16 µs |
|              16 |     33 µs |        18 µs |
|              32 |     37 µs |        22 µs |
|              64 |     47 µs |        26 µs |
|             128 |     68 µs |        40 µs |
|             256 |    107 µs |        66 µs |
|             512 |    190 µs |       134 µs |
|            1024 |    401 µs |       720 µs |
|            2048 |    937 µs |      1500 µs |
|            4096 |   1868 µs |      3600 µs |

@rex-schilasky

appreciate for your information, that looks awesome.

i am curious and got some questions though,

  • this performance result is eCAL primitive inter-process com result, right? not including ROS2 API.
  • eCAL, is it one of the DDS-RTPS implementation? that means we can communicate other DDS implementation. (i did go through readme quickly, i believe that it has some idea from DDS but not exactly DDS implementation.)
  • Do you have any plan to implement rmw_ecal in future?

thanks,
Tomoya

@tomoyafujita

you are welcome …

  • The measured performance is indeed the primitive inter-process communication. eCAL has a layered design in that way that you can combine different kind of transport layers with different serialization formats (like google protobuf or capnproto).
    That means you can use eCAL to just exchange raw payloads with timestamp or you can use higher level templated message pub/subs to use modern serialization protocols on top.

  • eCAL is not designed to be DDS specification compatible. For highest performance in the field of autonomous driving we decide to not be “wire compatible” and only support a minimal set of QoS.
    Even DDS is a nice standard the different implementations like RTI or OpenSplice can only exchange data if you use the DDSI specified protocol strictly. Some vendors allow shared memory data exchange, but from my knowledge this is not part of the standard.
    Finally the message schema evolution possibilities of modern protocols like google protobuf were a strong requirement for eCAL and are hard to realize with known DDS implementations.

  • A rmw_ecal plugin would be great for sure. But if the interoperability to DDS standard is a strong requirement it makes no sense. On the other hand you can configure eCAL to use fastRTPS instead of it’s one transport layers, but that would make no sense at all for sure because ROS2 can use fastRTPS natively ;-).

Cheers, ReX

@rex-schilasky

thanks for sharing your thoughts! that helps a lot.
i do agree with you, it is about the use-cases most likely.
but i believe that the more generic the better it is for community.

actually we have already tried out shared memory based pub/sub library(Sony internal lightweight library) to connect rmw_sony_cpp, which is NOT DDS spec compatible. we do this to see if how we can make it faster compared against rmw_xxx on ROS2 rclcpp APIs for sure.

now we are considering that something with

  • DDS specification compatible
  • Open Source / Free
  • Shared Memory Feature Available
  • RMW supported

we believe that is gonna be good for everyone and ROS community.

thanks,
Tomoya

@tomoyafujita

the “sony internal approach” sounds very familar to me. I fully agree to your conclusions, eCAL could finally realize 2 of 4 of the requirements only (OpenSource, Shared Memory). To realize a rmw_ecal wouldn’t be that complicated because eCAL’s API is very similar to ROS’ API and covers most of it’s features.

To interact with other IPC/DDS systems, eCAL can switch on/off for a complete host, a single process or a single pub/sub connection different transport modes (like Google’s LCM or fastRTPS). This feature is the compromise that we implemented to bridge messages in/from other systems like ROS2. But I’m not sure if we support that approach in future releases or if we exclude it again and realize dedicated gateways instead.

fastRTPS is from my point of view an excellent choice for the ROS2 middleware layer today. It’s well designed, very lean and shared memory is on their roadmap. But only to complete your requirement list it would be a nice task to realize a rmw_ecal :).

There is no requirement for an RMW implementation to use DDS or be compatible with it on the wire. The interface is intentionally designed in a way that it can be satisfied by other middleware (see the design article. One example of a non-DDS RMW impl.: rmw_dps.

1 Like

@dirk-thomas

Looks really interesting. Didn’t expect that there are so many existing DDS and None-DDS rmw implementations. Need to study the rmw design more in depth. Thank you for pointing me in the right direction.

1 Like

Just for completeness: I quickly performed the OpenSplice bundled roundtrip-test using a federated (shared-memory) deployment on 2 machines (win10/i7-8550U@1.99Ghz laptop and ubuntu16/xeon E3-1270 server from 2012) and got the following results (end-to-end latencies in usec.):

size (KB) win10 Ubuntu16
1 12 11
2 13 11
4 13 12
8 14 13
16 15 14
32 16 17
64 19 22
128 25 34
256 38 57
512 64 103
1024 138 193
2048 290 417
4096 714 1034

This i.m.o. shows the fundamental difference between a ‘shared-memory-transport’ (as other DDS-vendors exploit) and ‘shared-memory-based datasharing’ as OpenSplice applies for its federated deployment.

Note that this is ‘raw’ DDS-performance i.e. without any rmw-layer and can be easily reproduced using the bundled ‘Roundtrip’ example that shows min/mean roundtrip-latency (so end-to-end latency is roundtrip/2) as well as write/read execution times which also shows that on these platforms, the end-to-end latency is the writeTime + readTime + 2 us. … I can share the ‘raw’ results too if that’s interesting.

PS> another rather fundamental difference between a ‘shared-memory-transport’ and ‘shared-memory-based datasharing’ is that for the latter, regardless of the number of subscribed applications within that ‘federation’ there will be always only 1 copy of the payload ‘in use’ (in shared-memory), related meta-data (w.r.t. an instance being read/not-read, new-not-new, alive/disposed) is of course managed for each subscriber but for large® payloads and/or many applications ‘per node’ this does have a significant impact on both scalability as well as determinism.

1 Like

This is a really impressive performance. So if you exchange payload between multiple matching pub/sub entities you share the data by only ONE memory file, this is called ‘data sharing’ ? So finally eCAL is doing it the same way. Publisher organizing their payload in memory and informing ‘listening’ subscribers for updates.
For small payloads the speed only depends on the kind of ipc event mechanism. But because the subscriber is copying the payload from the shared memory in its process memory (to provide other readers access) there is unfortunately a second memcopy and this leads to the higher latency for large payloads.
Can the ‘data sharing’ methodology of OpenSplice avoid this ‘second’ copy ?