ROS2 latency using different node setups

Oh I see, so by IDL you meant the OMG version used by DDS, I was confusing that with an IDL I thought was specific to ROS nodes (nodeIDL?).

Does this mean you can generate a ROS message file (.msg) from the DDS IDL? Or does this mean some intermediate files in C++ or Python that typically get created from the ROS message file?

I am just curious about the best way to organize a big system composed of ROS 2 nodes from various distributions and some pure DDS nodes. I am still not clear about whether it would be better to specify all of the data models for the system in the ROS .msg format and generate IDL files for the DDS nodes or whether to use the IDL format and then back generate all of the .msg files for each ROS distribution.

I am assuming that it would be nice to store copies of .msg files with ROS nodes and IDL files with DDS nodes to make it easy to reuse those nodes in other simpler systems that only use a single ROS distribution or only use pure DDS nodes. You could just specify the data model for each node in its own preferred format directly, but it seems like it would be easier to make sure all of the data models match in the hybrid system if a common general format is used and the other formats are generated automatically.

IDL typically is in reference to OMG IDL, which is used by DDS implementations. The .msg files are referred to as rosidl or ros message file or ros interface file or something like that. We usually never refer to it as just ā€œidlā€.

nodeIDL isn’t a thing AFAIK, but maybe you mean GitHub - ubuntu-robotics/nodl: CLI and parsing utilities for the ROS 2 NoDL, which is something very different from OMG IDL and rosidl.

1 Like

Hi all and @urczf,

Interesting insights so far, and we appreciate the work you’ve done with us since then. We went ahead and reproduced your performance tests on our side and also made some test updates to both include the new RMW Connext we have been developing (rmw_connextdds and rmw_connextddsmicro), as well as test some multi-process scenarios. You can see some results in this blog post (including a video walkthrough).

We tend to recommend test scenarios that are a little different from the basic ā€œsingle processā€ one in the paper. This is to better approximate our users’ deployments. The tests I’m used to seeing have communication nodes running in separate processes at a minimum, if not on different machines.

I was reading through the thread here - @carlossv pointed out the same in his long post (thanks carlos!), and @urczf - you pointed out that a clever application could figure out it’s running in the same process and just bypass the network layer entirely. At least, that’s how I would do it too. Connext supports some of these methods (e.g. zero copy shared memory), but it is currently not accessible through the standard RMW APIs.

Beyond this, the main comment we wanted to make is that the poor results obtained with the current rwm_connext_cpp are not caused by Connext DDS. Rather they represent known inefficiencies (extra memory allocations and copies) introduced in the RMW layer itself. We had already validated this observation by contributing a Connext DDS Communicator to the Apex.ai test and comparing the results of that with those using the rmw_connext_cpp.

It is for these reasons that RTI has contributed two new RMW Connext implementations: rmw_connextdds and rmw_connextddsmicro. These have much better performance and are in the process of being included in the ROS2 nightly builds. Our new RMW code is here.

Please feel free to reach out to me or just post here if you have any questions, I’ll help out as best as I can.

Enjoy!

Mark

Yes we know and that’s what we pointed out in the paper.

We did some evaluation and indeed, it is huge difference if you evaluate latency of the nodes in one process or in multiple processes. Therefore, we will reevaluate this but with the nodes being in different processes.

The internal behavior of the DDS middlewares, namely to bypass OS Stack and perform intra-process communication also occurs if the nodes are started in different contexts (which means they are mapped to different DomainParticipants). Apparently, the only thing which counts is if in the same process or not.

We’ll get back to you as soon as we have the new results. I will give the blog post a read, thank you @Mark_Hary !

Hi Mark,

Thank you for the interesting blog post and it’s great to see the improvements in connext dds.
I would like to point out a few items though.

I agree that test scenarios should be as much differentiated as possible, but the single process use-case should be as important as the other ones.
Neglecting it means ignoring the world of robotics applications running on embedded platforms with constrained resources (e.g. raspberry pi), which are currently those that more desperately require performance improvements.

This scenario is definitely less challenging (as it does not require serialization, etc), but still the performance of all the different RMW implementations are not comparable with those of a single process application that does not use pub-sub (or also of a ROS application that uses the experimental intra-process communication feature of the rclcpp layer).
I think that this is a barrier that makes it difficult for people who already have their working robotics application to transition to ROS.

In the blog post it is mentioned that it is possible to send pointers in the messages rather than real data, but that would only work in a single machine use-case, i.e. it would not allow debug and introspection from your computer or communication with another robot.

Unfortunately, from a ROS user point of view I can’t really see the difference between inefficiencies in the DDS implementation or inefficiencies in the RMW layer.
If I have a ROS application, I can only evaluate those two pieces as a whole and the almost only knob that I can play with is the environment variables that selects which RMW implementation to use (fastdds, cyclonedds, connext, etc).

There have been discussions on having a single RMW layer, shared by all DDS implementations.
Maybe going down this road it would be possible to have one well implemented library rather than multiple ones where each maybe has known inefficiencies, missing features or others.

He refers to the paper, where we launched multiple nodes in the same process without using Intra-Process Communication. I suppose you refer to the case that IPC is actually being used. I agree with you on that.

Further, the scope of the tests is different imho. This blog post analyzes ā€œOOTBā€ behavior (which probably 90% of the ROS2 users follow), i.e. writing multiple nodes without tweaking much. If this is the correct situation to analyze, is indeed worth a discussion. However, I think there will be no correct answer on that.

In addition, these results can (more or less) be qualitatively mapped to a multi-machine use case if you only want to know how scalable a ROS2 system in general is, i.e. if you want to measure the latency entailed by a ROS2 system (OOTB) spanning over multiple agents.

100% agree.

I suppose they are referring to the profiling in the ROS2 paper and it’s nice to know for the user that it’s not connext’s fault if latencies are bad, but the RMW maintainer’s fault (the experience remains the same though…).

Nice read. Thanks for sharing.

Yeah, I know what you mean about the RMW/DDS layers being obfuscated from the developer and we don’t want you to have to profile between the two. Jacob @ Open Robotics did the community a great service in putting the connext rmw layer together in the first place. You are 100% right about the inseparable experience - so we’ve decided it is our responsibility to support the community with an interface to Connext :grinning:. Let me know if you see any other performance issues and I’ll get it in the queue.

re: single process applications (and raspis) … try out the new layer and let me know what you think. It’s not as fast as passing a pointer or sharing a memory space, but it should be much more reasonable now.

Regarding the mw part of rmw - that’s part of the fun! In all seriousness though, this is part of the baggage that needs to be managed from having one standard (ROS) built on top of another standard (DDS). At some point, given convergence, you’d just have duplicate standards, or one having a complete subset of the other. I’m new enough to this and don’t have an answer, but we’ll figure it out @ the rmw wg meetings.

M

@Mark_Hary, from the blog post you cited above, could you expand a little more on the last sentence within this paragraph, and how the current ROS RMW API limits such mapping? Is this rectifiable?

Incidentally, Connext DDS also supports a Zero Copy mode that can get the increased performance when sending large data inside a Process or over shared memory. Moreover, the Zero Copy approach used by Connext does not ā€œbypassā€ the serialization and protocol and therefore does not cause the undesirable ā€œcouplingā€ side-effects. However, the Connext Zero Copy utilizes a specialized API that cannot be mapped to the ROS RMW API.

@ruffsl Strike ā€œcannotā€ and replace it with ā€œis notā€ in that statement.

It may be rectifiable, we’ll have to think more about it. We’re focusing on better OOTB behavior so that it will just work for you all w/out much thought. I’m guessing that’s what you really want :).

We’ll keep on posting as our rmw evolves, and of course, we’ll keep an eye on discussions as to what you want/need.

M

Hi,

provided the feedback from APEXAI, notably, we updated our methodology (basically spawning the nodes in separate processes and afterwards benchmarking the latency) + used the 95 percentile for evaluation. The results can be found here: https://arxiv.org/pdf/2101.02074v3.pdf

3 Likes

Hi @urczf

Impressive effort, and I am going to highlight the results, quoting your article:

"In the case of desktop PC (…), FastRTPS is slighty faster than Connext, whereas CycloneDDS entails the highest latency."

"for the Raspberry Pi (…), We can observe that Connext and CycloneDDS yield comparable results, whereas FastRTPS is significantly faster"