Is there any DDS implementation using shared memory inside system?

eboasson · August 1, 2019, 7:26am

I am little late to this discussion and there is a lot of interesting and concrete information present already, but I would nonetheless like to add some thoughts on what might be an attractive alternative.

Eclipse Cyclone DDS today gives you a pretty decent small message latency over loopback (on a 3.5GHz Xeon E3-1270 running Ubuntu 16 - most likely the same configuration that @hansvanthag used above), as shown below. If you compare it to the various numbers shown in earlier comments, squinting a bit to allow for different measurement environments, you see it comes out rather well for the small ones, even though it uses the loopback interface, and fares not so badly against the latencies mentioned earlier using a shared memory transport for the large ones.

size       latency [0]
(bytes!)   (microseconds)
-------------------------
      4 [1]     10
     16         11
    128         11
   1024         13
   4096         16 [2]
  16384         28
  32768         44
  60000         67
 100000        103
 200000        202
 500000        473
1000000       1088

[0] median latency as measured by doing round-trips as fast as possible and halving the measured round-trip time, using reliable communications. I don’t have ROS2 available on real hardware, so I can’t perform the same test over ROS2, even though Cyclone has an RMW implementation;
[1] keyless topic of one int32_t, the others have an int32_t key field
(always set to 0) and an octet sequence;
[2] this is with a fragmenting threshold set large enough to not fragment the ones that fit in a single UDP datagram; with the default setting the numbers are worse.

A shared memory transport would in all likelihood end up with a slight improvement over the 10us small message latency + however long copying the data takes. Judging by the number of copies involved, that’d be a bit worse than OpenSplice using shared memory from the table above.

Fortunately, there are more interesting options as well. Cyclone allows custom sample representations that manage their memory however they see fit, and the Cyclone RMW implementation uses that to transform directly between the ROS2 memory representation and CDR. That flexibility can easily be taken a step further: instead of serialising to CDR, it could just as easily copy the sample into shared memory, and then pass a (smart) pointer in a regular protocol message via the loopback interface.

Yes, there are some complications: for example, you’d have to copy a vtable into private memory residing at the same address in each process, but such things are trivialities. You’d also have to figure out how the RMW level implementation were to know whether it should just pass a pointer (which’d work for other processes attached to the same shared memory), or pass the CDR (for all other processes and nodes). That’s probably a bit less than trivial.

I would wager that solving those problems is a more interesting exercise than doing yet-another RMW layer and that it would get you (on this ancient machine) a latency for 1MB samples of about 100us. So it should be worthy of a proof-of-concept at least.

But as it is, other pressing obligations prevent me from doing it in the near future. So unless someone takes up the challenge, it’s no more than vaporware …

Topic		Replies	Views
Fast DDS v2.2.0 latency performance ROS General	2	3428	April 19, 2021
Fast DDS v2.2.0 throughput performance ROS General	0	1718	March 1, 2021
DDS implementation performance benchmark ROS General	12	5879	March 28, 2023
Python bindings for shared memory DDS Training & Education ros2 , galactic	1	882	November 29, 2021
ROS2 Foxy & RMW Fast DDS: Improved Intra-process & Inter-process performance ROS General	15	3514	November 30, 2020

Is there any DDS implementation using shared memory inside system?

Related topics