Currently, i’m trying to create a composition-based navigation package for embedded systems users that need to make optimizations due to harsh resource constraints. ( compose all Nav2 nodes in a single process instead of launching these nodes separately)
first, i create multiple nodes in a single main() (manual composition).
then, i create a multi-threaded executor to spin these nodes.
in addition, i also use option.use_intra_process_comms(true) to get efficient intra-process communication.
There were some problems when i tried to implement these.
The first problem: some QoS isn’t supported (for example transient-local ), so some nodes which use transient-local couldn’t enable intra-process communication.
The second problem: it consumes higher cpu usage (increase 50%), which confuses me.
For the first problem, i noticed that intra-process comms of rclcpp are still kind of under development, you could find more details:
Manual composition with multiple single-threaded executors : use a bunch of single-threaded executors for these nodes
Manual composition with a multi-threaded executor: use a large multi-threaded executor for these nodes
It’s obvious that a large multi-threaded executor consumes higher cpu(increase 30%-50%) than a bunch of single-threaded executors , and i don’t know why, any discussion is welcome!
Hello,
I think it is a very nice project. Couples of suggestions/comments:
include a DDS comparison. In my experience which middleware is used has a major impact on performance and I believe some vendors did not implemented all the the IPC features yet.
Mention the ros version/distribution used (foxy, galactic, rolling ? / Ubuntu ?)
Firstly, great to see you are working making Nav2 composable! That’s really important in many use-cases, and as I would have expected, your measurements show that RAM usage is reduced a lot.
Secondly, before speculating on what could cause higher CPU usage, it would be important to know more about your measurement approach:
Which threads are you measuring?
How are you aggregating the data temporally to come up with the average?
I always recommend to make a plot of resource usage, to see whether the system has a steady-state behavior, or whether resource usage fluctuates.
Last, but not least, it would be interesting to understand the work-load – is this in simulation, is it repeatable, etc.?
One thing with changing communication is that, when your system is IO-bound, lowering the latency can lead to more computation being done, because data arrives sooner. That would drive up overall CPU usage!
However, with recent DDS implementations, there is already a lot of optimization done “under the hood”. For example, FastRTPS already use in-process communication when it detects that the endpoints are in the same process, and also already uses shared-memory, when it detects endpoints are on the same host. Maybe CycloneDDS does something similar. With those automatic optimizations, I would expect that the gains from rclcpp-based intra-process comms are fairly small – though I would certainly not expect that it goes up!
thanks for your suggestions, i will try to make some experiments using different DDS vendors. and the examples you mentioned about ros2 composition is helpful for me.
in my opinion, Intra-process comms of rclcpp is under development, the best way for user is to disable rclcpp intra-process comms (default in ros2) currently, because of some problems (e.g. QoS limitations).
and, DDS Cyclone (for ROS2 Galactic ) supports IPC which could improve performance the table shows , but i also agree with you that some vendors did not implemented all the the IPC features yet.
Unfortunately, this is not the case. Despite the improvements in the DDS layer, there are still big differences between enabling or disabling rclcpp intra-process.
Considering a test where you have only 1 process, with an arbitrary number of nodes and all of them use rclcpp-intra-process, enabling rclcpp intra-process is approximately twice as fast and with half the CPU with respect to the default (i.e. disabled).
See some recent data from this page: High CPU usage when using DDS intra-process communication · Issue #1642 · ros2/rclcpp · GitHub
The situation changes drastically if not all the entities in the system have rclcpp-intra-process enabled.
For example, if you have a publisher and two subscribers, with one sub using rclcpp intra-process and one sub that does not use it, you will have a lot of extra work: the publisher will need to do additional copies of the message and the dds will deliver messages also to the subscriber that has rclcpp intra-process enabled, thus doing double the amount of work (note that this sub will discard the received message because it will have already received it from rclpp intra process)
These problems have always been present in ROS, so it’s not a recent bug in Galactic.
I haven’t looked in details at all the system that you are running, but since you mention that some nodes couldn’t enable rclcpp intra process due to unsupported QoS, I expect that this was the problem.
We are currently looking at the best way to fix.
@gzep would you mind running a test with all nodes using default QoS?
About the multi-threaded executor, I confirm that it always performs much worst than using a bunch of single threaded executors.
My preference is using pandas for handling the data. When you have your data in a Pandas Series or DataFrame, you can just call “plot” on it. This works best in a Jupyter Notebook, where you’ll get the result image displayed directly, but you can also save it to disk.
Under the hood, pandas by default uses matplotlib, though bokeh, plotly and a bunch of other plotting libraries are also supported. You can of course also use these directly.
btw, I saw in your script that you are sampling the data every 2s, i.e. 0.5Hz, – for more insight, I would recommend sampling at least at 10Hz.
@alsora thank you for that context, that is incredibly valuable insight to have our observed behaviors confirmed. That solidifies our decision to use manual composition with single threaded executors for the time being. It’s always hard to tell if this is “did I miss something?!” or actual behavior.
We’ll need to wait for more QoS profiles to be usable with intra-process comms before we can enable that in the stack.
I now have a little bird chirping in my ear about potentially making the default bringup scheme for Nav2 be a manually composed bringup to have this behavior come out by default. However since we still can’t do lifecycle-components in launch, it would have to be manually composed.
Which would work fine, the main concern I have is around localization system vendorization (e.g. AMCL, SLAM Toolbox, other systems) if manually composed. Perhaps separate out the localization system but have the rest of Nav2 be manually composed? What do you think? I think the rest of the servers / system are pretty standard for everyone’s use and if they don’t want one of them, they could always remove it from the list of nodes to transition up in lifecycle manager and leave very little overhead (but they should probably just create a new manual composition file with what they need in their company’s specific bringup package)