Is there any DDS implementation using shared memory inside system?

  • what is the platform device and ROS2 version?

we use Ubuntu 18.04 with ROS2 dashing installed from deb. CPU use Intel core i5.

  • How many subscribers on the Reader side, if not a problem?

For our previous test, we simply use one publisher and one subscriber. We will make some test for multiple nodes in future.

  • Is there any chance to support much bigger data, such as MB order images?

Definitely yes, it should support MB order image, but we haven’t done that tunning yet :slight_smile:

1 Like

It tuned for the minimum latency and maximum throughput. Fragment size, and some memory management parameters in shared memory mode affect the performance.

1 Like

@cwyark

thanks, that helps.

The eCAL project just published performance measures for interprocess communication.

Two new sample applications are provided in the latest release to measure performance on other platforms, we would like to get some measures from the community :slight_smile: .

These are the platforms we used to measure. For details how to measure see ReadMe.md.

-------------------------------
 Platform Windows 10 (AMD64)
-------------------------------
OS Name:                            Microsoft Windows 10 Enterprise
OS Version:                         10.0.16299
OS Manufacturer:                    Microsoft Corporation
OS Build Type:                      Multiprocessor Free
System Manufacturer:                HP
System Model:                       HP ZBook 15 G5
System Type:                        x64-based PC
Processor(s):                       1 Prozessor(s) Installed.
                                    [01]: Intel64 Family 6 Model 158 Stepping 10 GenuineIntel ~2592 MHz
Total Physical Memory:              32.579 MB

-------------------------------
 Platform Ubuntu 16 (AMD64)
-------------------------------
H/W path      Device    Class       Description
===============================================
                        system      HP ZBook 15 G3 (M9R63AV)
/0                      bus         80D5
/0/0                    memory      128KiB L1 Cache
/0/1                    memory      128KiB L1 Cache
/0/2                    memory      1MiB L2 Cache
/0/3                    memory      8MiB L3 Cache
/0/4                    processor   Intel(R) Core(TM) i7-6820HQ CPU @ 2.70GHz
/0/5                    memory      16GiB System Memory
/0/5/0                  memory      8GiB SODIMM Synchron 2133 MHz (0,5 ns)
/0/5/1                  memory      8GiB SODIMM Synchron 2133 MHz (0,5 ns)

Latencies in µs for a single pub/sub connection (20000 samples, zero drops). Shared memory transport layer is used as default for local communication.

|Payload Size (kB)|Win10 AMD64|Ubuntu16 AMD64|
|----------------:|----------:|-------------:|
|               1 |     25 µs |        14 µs |
|               2 |     25 µs |        14 µs |
|               4 |     26 µs |        15 µs |
|               8 |     28 µs |        16 µs |
|              16 |     33 µs |        18 µs |
|              32 |     37 µs |        22 µs |
|              64 |     47 µs |        26 µs |
|             128 |     68 µs |        40 µs |
|             256 |    107 µs |        66 µs |
|             512 |    190 µs |       134 µs |
|            1024 |    401 µs |       720 µs |
|            2048 |    937 µs |      1500 µs |
|            4096 |   1868 µs |      3600 µs |

@rex-schilasky

appreciate for your information, that looks awesome.

i am curious and got some questions though,

  • this performance result is eCAL primitive inter-process com result, right? not including ROS2 API.
  • eCAL, is it one of the DDS-RTPS implementation? that means we can communicate other DDS implementation. (i did go through readme quickly, i believe that it has some idea from DDS but not exactly DDS implementation.)
  • Do you have any plan to implement rmw_ecal in future?

thanks,
Tomoya

@tomoyafujita

you are welcome …

  • The measured performance is indeed the primitive inter-process communication. eCAL has a layered design in that way that you can combine different kind of transport layers with different serialization formats (like google protobuf or capnproto).
    That means you can use eCAL to just exchange raw payloads with timestamp or you can use higher level templated message pub/subs to use modern serialization protocols on top.

  • eCAL is not designed to be DDS specification compatible. For highest performance in the field of autonomous driving we decide to not be “wire compatible” and only support a minimal set of QoS.
    Even DDS is a nice standard the different implementations like RTI or OpenSplice can only exchange data if you use the DDSI specified protocol strictly. Some vendors allow shared memory data exchange, but from my knowledge this is not part of the standard.
    Finally the message schema evolution possibilities of modern protocols like google protobuf were a strong requirement for eCAL and are hard to realize with known DDS implementations.

  • A rmw_ecal plugin would be great for sure. But if the interoperability to DDS standard is a strong requirement it makes no sense. On the other hand you can configure eCAL to use fastRTPS instead of it’s one transport layers, but that would make no sense at all for sure because ROS2 can use fastRTPS natively ;-).

Cheers, ReX

@rex-schilasky

thanks for sharing your thoughts! that helps a lot.
i do agree with you, it is about the use-cases most likely.
but i believe that the more generic the better it is for community.

actually we have already tried out shared memory based pub/sub library(Sony internal lightweight library) to connect rmw_sony_cpp, which is NOT DDS spec compatible. we do this to see if how we can make it faster compared against rmw_xxx on ROS2 rclcpp APIs for sure.

now we are considering that something with

  • DDS specification compatible
  • Open Source / Free
  • Shared Memory Feature Available
  • RMW supported

we believe that is gonna be good for everyone and ROS community.

thanks,
Tomoya

@tomoyafujita

the “sony internal approach” sounds very familar to me. I fully agree to your conclusions, eCAL could finally realize 2 of 4 of the requirements only (OpenSource, Shared Memory). To realize a rmw_ecal wouldn’t be that complicated because eCAL’s API is very similar to ROS’ API and covers most of it’s features.

To interact with other IPC/DDS systems, eCAL can switch on/off for a complete host, a single process or a single pub/sub connection different transport modes (like Google’s LCM or fastRTPS). This feature is the compromise that we implemented to bridge messages in/from other systems like ROS2. But I’m not sure if we support that approach in future releases or if we exclude it again and realize dedicated gateways instead.

fastRTPS is from my point of view an excellent choice for the ROS2 middleware layer today. It’s well designed, very lean and shared memory is on their roadmap. But only to complete your requirement list it would be a nice task to realize a rmw_ecal :).

There is no requirement for an RMW implementation to use DDS or be compatible with it on the wire. The interface is intentionally designed in a way that it can be satisfied by other middleware (see the design article. One example of a non-DDS RMW impl.: rmw_dps.

1 Like

@dirk-thomas

Looks really interesting. Didn’t expect that there are so many existing DDS and None-DDS rmw implementations. Need to study the rmw design more in depth. Thank you for pointing me in the right direction.

1 Like

Just for completeness: I quickly performed the OpenSplice bundled roundtrip-test using a federated (shared-memory) deployment on 2 machines (win10/i7-8550U@1.99Ghz laptop and ubuntu16/xeon E3-1270 server from 2012) and got the following results (end-to-end latencies in usec.):

size (KB) win10 Ubuntu16
1 12 11
2 13 11
4 13 12
8 14 13
16 15 14
32 16 17
64 19 22
128 25 34
256 38 57
512 64 103
1024 138 193
2048 290 417
4096 714 1034

This i.m.o. shows the fundamental difference between a ‘shared-memory-transport’ (as other DDS-vendors exploit) and ‘shared-memory-based datasharing’ as OpenSplice applies for its federated deployment.

Note that this is ‘raw’ DDS-performance i.e. without any rmw-layer and can be easily reproduced using the bundled ‘Roundtrip’ example that shows min/mean roundtrip-latency (so end-to-end latency is roundtrip/2) as well as write/read execution times which also shows that on these platforms, the end-to-end latency is the writeTime + readTime + 2 us. … I can share the ‘raw’ results too if that’s interesting.

PS> another rather fundamental difference between a ‘shared-memory-transport’ and ‘shared-memory-based datasharing’ is that for the latter, regardless of the number of subscribed applications within that ‘federation’ there will be always only 1 copy of the payload ‘in use’ (in shared-memory), related meta-data (w.r.t. an instance being read/not-read, new-not-new, alive/disposed) is of course managed for each subscriber but for large® payloads and/or many applications ‘per node’ this does have a significant impact on both scalability as well as determinism.

1 Like

This is a really impressive performance. So if you exchange payload between multiple matching pub/sub entities you share the data by only ONE memory file, this is called ‘data sharing’ ? So finally eCAL is doing it the same way. Publisher organizing their payload in memory and informing ‘listening’ subscribers for updates.
For small payloads the speed only depends on the kind of ipc event mechanism. But because the subscriber is copying the payload from the shared memory in its process memory (to provide other readers access) there is unfortunately a second memcopy and this leads to the higher latency for large payloads.
Can the ‘data sharing’ methodology of OpenSplice avoid this ‘second’ copy ?

There’s indeed a single memcopy ‘into’ the shared-memory which effectively populates the DDS reader-cache of all subscribed ‘federated’ applications and then followed by memcopy’s of all matching co-located/federated readers from shared-memory in to the local reader’s address-space. We deemed it unsafe to have application point directly into shared-memory so we didn’t (want to) avoid that ‘second copy’.

Another aspect comes in when there’s also remote interest in the (same) data and thats our ‘network-scheduler concept’: each federation has a single/singleton network-scheduler (per interface) that handles all data coming into the federation (from the network) as well as going out of the federation (onto the network). The advantage here is that -as in DDS, each data-sample has associated urgency (latency-budget) as well as importance (transport-priority) its this scheduler that can prioritize samples from any application in the federation (for which it exploits traffic-shaped priority-lanes) as well as batch samples based on the latency-budget (for which it exploits packing into large UDP-frames).
So a low-priority application could send a high-priority alarm(-sample) which then would pre-empt lower priority data end-to-end (both at sending-side as well as over-the-network using DIFSERV and on the receiving/delivery side) … so a nice balancing between efficiency and determinism, something that’s really valued in the domain we’ve emerged from years ago (naval combat management systems)

2 Likes

Thank you for sharing all these insights, really nice. So many tweaks and years of experience, that’s amazing.
Is there any shared memory based message exchange that is standardized from the OMG or is this planned ? Don’t think so but it would be nice to get that info at first hand.

@hansvanthag
@rex-schilasky

this is fantastic, and great work!
and these are really good insights for me.

Note that this is ‘raw’ DDS-performance i.e. without any rmw-layer

we are really interested and appreciated if you could share with RMW layer.
we would do that with OSS version but it does not support shared memory, right?

‘shared-memory-transport’

just transportation is done via shared memory,
which means RTPS protocol via shared memory?

‘shared-memory-based datasharing’

no serialization, no RTPS protocol, right?

We deemed it unsafe to have application point directly into shared-memory
so we didn’t (want to) avoid that ‘second copy’.

this is reasonable, it is hard to tell who is the actual owner.
we might need lifetime / metadata(like okay to free?) in that shared memory.

Is there any shared memory based message exchange that is standardized from the OMG or is this planned ?

we also like to know if any plan?

thanks

  1. on OSS version: indeed that doesn’t support shared-memory (federated deployment) at the moment
  2. on ‘shared-memory transport’: that’s indeed RTPS ‘over shared-memory’
  3. on ‘shared-memory-based datasharing’: indeed no RTPS-protocol and/or serialization
  4. on ‘lifetime’ of data in shared-memory: this is fully determined by DDS-QoS (i.c.w. reference-counted payloads) i.e. data/instance-lifecycle w.r.t. (un)registring, disposing and lifespan
  5. on ‘standardization’: as its an implementation (that does have non-functional/performance impact, but no ‘functional’ impact) its unlikely to get standardized (similar to the shared-memory-transport that other DDS-vendors have flavors of). Note also that in OpenSplice, switching between ‘standalone’ deployment and federated (shared-memory-based) deployment is done by a simple configuration boolean ("singleprocess’=true/false), no need to change/recompile/link your applications ‘at all’

PS> one more (perhaps interesting) aspect of ‘federated deployment’ is that in case TRANSIENT and/or PERSISTENT data is exploited (which I don’t think is the case yet for ROS2), its typical to configure a ‘durability-service’ as part of a/any federation and that service will maintain that non-volatile data for late-joiners so that whenever/whereever there is a late joining application started on such a federation, that data is already present i.e. instantly available, so won’t induce any latency and/or network-traffic for that application to get ‘his’ historical data (for 2 reasons: 1. as durability-services align each other and perhaps more importantly: 2. as published data published is likely going to be multicasted anyway to active subscribers that include these ‘durability-services’, so no induced overhead for this replication)

2 Likes

@hansvanthag

thanks for enlightening me and insights, that helps a lot.

1 Like

Hello,

Currently the only OMG standard transport for DDS/RTPS is UDP.
We are working on standardizing both a TCP and a TSN transport.

Various DDS vendors (RTI included) support RTPS over Shared Memory. So far the demand for standardizing this had not been so great because to use shared memory the different processes must be on the same computer. In this situation it is less likely to find multiple implementations of DDS being deployed. Of course is this situation changed we could consider standardizing a shared memory transport…

Note that even if multiple DDS implementations that use non-standard shared memory transports were deployed in the same computer node, it would not break wire incompatibility. This is because the DDS/RTPS protocols are prepared to have multiple transports associated with every application (and even DataWriter/DataReader) . They automatically identify which transports are common to each DataWriter / DataReader pair and use those for communication. So if one implementation is using a vendor-specific shared transport, the other would recognize that and use one of the standard transports to communicate with that application. Of course for that one communication path you would lose the performance advantage of using shared memory.

Gerardo

1 Like

@GerardoPardo

understood, thanks for the intel.

tomoya

I am little late to this discussion and there is a lot of interesting and concrete information present already, but I would nonetheless like to add some thoughts on what might be an attractive alternative.

Eclipse Cyclone DDS today gives you a pretty decent small message latency over loopback (on a 3.5GHz Xeon E3-1270 running Ubuntu 16 - most likely the same configuration that @hansvanthag used above), as shown below. If you compare it to the various numbers shown in earlier comments, squinting a bit to allow for different measurement environments, you see it comes out rather well for the small ones, even though it uses the loopback interface, and fares not so badly against the latencies mentioned earlier using a shared memory transport for the large ones.

size       latency [0]
(bytes!)   (microseconds)
-------------------------
      4 [1]     10
     16         11
    128         11
   1024         13
   4096         16 [2]
  16384         28
  32768         44
  60000         67
 100000        103
 200000        202
 500000        473
1000000       1088

[0] median latency as measured by doing round-trips as fast as possible and halving the measured round-trip time, using reliable communications. I don’t have ROS2 available on real hardware, so I can’t perform the same test over ROS2, even though Cyclone has an RMW implementation;
[1] keyless topic of one int32_t, the others have an int32_t key field
(always set to 0) and an octet sequence;
[2] this is with a fragmenting threshold set large enough to not fragment the ones that fit in a single UDP datagram; with the default setting the numbers are worse.

A shared memory transport would in all likelihood end up with a slight improvement over the 10us small message latency + however long copying the data takes. Judging by the number of copies involved, that’d be a bit worse than OpenSplice using shared memory from the table above.

Fortunately, there are more interesting options as well. Cyclone allows custom sample representations that manage their memory however they see fit, and the Cyclone RMW implementation uses that to transform directly between the ROS2 memory representation and CDR. That flexibility can easily be taken a step further: instead of serialising to CDR, it could just as easily copy the sample into shared memory, and then pass a (smart) pointer in a regular protocol message via the loopback interface.

Yes, there are some complications: for example, you’d have to copy a vtable into private memory residing at the same address in each process, but such things are trivialities. You’d also have to figure out how the RMW level implementation were to know whether it should just pass a pointer (which’d work for other processes attached to the same shared memory), or pass the CDR (for all other processes and nodes). That’s probably a bit less than trivial.

I would wager that solving those problems is a more interesting exercise than doing yet-another RMW layer and that it would get you (on this ancient machine) a latency for 1MB samples of about 100us. So it should be worthy of a proof-of-concept at least.

But as it is, other pressing obligations prevent me from doing it in the near future. So unless someone takes up the challenge, it’s no more than vaporware …

3 Likes