I have been using in a project various nodes, some of which in C++, and some in Python. I noticed a huge difference, even for basic nodes that just do some publishing, in CPU usage between Python and C++. Python CPU usage is generally several times higher than a C++ node doing the same.
This happens for publisher nodes, as well as for server nodes, even in their idle state when no services or actions are actually requested, which I find extremely strange…
Did anyone notice similar issues? Does anyone have any recommendations for reducing the CPU usage of Python nodes?
The issue has been observed so far in ROS2 Galactic, under different CPU architectures (both amd and arm)
Every single time I observed that, it was a busy loop that could be easily avoided (while loop without any sleep or blocking operation).
There was always a simple solution to avoid that. If you share the code of the main loop, I can give you some hints
At the risk of sounding callous, you guys do realize you’re comparing a compiled language known for its memory efficiency and speed to one that’s none of those three things. The virtual machine will add overhead, so will the interpreter, and of course the language itself needs more processor cycles to run, given dynamic typing and other high level abstractions. Even if you’re not receiving anything there’s still a thread in the background that needs to periodically poll for new socket data I presume. Unlike C++ which will strip out all unused code at compile time, Python also needs to keep all imported modules loaded into memory.
But none of us are using Python because it’s fast or memory efficient, we’re using it because it cuts down development time by an order of magnitude
I would have to check again but I have noticed this same ssue with the CLI tools such as ros2 topic or ros2 bag record
that are implemented in python. I believe one basic issue is the executor used in this setting that has the busy loop (as @facontidavide mentioned) that can eat up a whole CPU even if if there is little actual work to be done. I don’t think it’s to just get rid off, though
Ah yeah once something caps out a core you’re either doing something wrong or processing more of something than you have the capacity to and it may be time to multithread. If that’s the really case then it’s surely a problem.
But in my experience the rule of thumb for non-busy code is that if something takes 1-2% of CPU in C++ that’ll likely still be around 9-15% in python. Usually that trade-off is still acceptable.
Sorry for the late reply.
Thanks a lot everyone for all the great feedback!
I see there are contrasting opinions on the topic, so I started to analyze things a bit more in detail on our code base.
I found out, surprisingly, that the high CPU usage I had noticed appears only in our testing pipelines with launch_testing.
There I have e.g. an idle action server node that listens for action goals and does nothing else, with a more or less constant CPU usage of that specific process of about 40% of a core (and this huge load for an idle process is the main reason why I started investigating potential performance issues with rclpy).
I now tried to analyze the same idle node when run individually via ros2 run, and the CPU load of that same process is instead close to 0%.
As a next step I will look further into whether the cause is launch_testing, something in our pipelines, or somehow the effect of the other nodes that are launched in parallel in our tests.
I’ll update you on these findings as soon as I get some results, and if it is still unclear I’ll also share the code of the problematic nodes with you.
Might take a few days, since in parallel I am also working on several other topics, so I apologize in advance for the wait.
Although far from 100%, we have seen high CPU usage by nodes that don’t do much. For example a simple node that just subscribes to a high-frequency topic (e.g. /clock in simulation) can consume about 10-15% of a single CPU… I would need to profile more, and on a recent distribution (this was Foxy), but I suspect that there are quite a lot of inefficiencies in the executor and / or the deserialization of messages.
Serialization and deserialization of non-primitive elements (such as ROS time) in ROS2 is slow in C++, and can be terribly slow when using Python. Because it involves deserialization, the simple command “ros2 topic hz” can choke up so badly that it is close to useless for measuring the actual frequency at which a topic is published, at least for high frequency messages containing non-primitive types. This sad fact is re-learned by every ROS2 newby (like me). Please see below link for more.
After some more investigation I found that it really is the gazebo simulation that causes the peaking CPU usage in python nodes.
As @haudren pointed out, this really seem to be caused by python nodes listening to the sim_time coming from gazebo.
In our setup in particular, we had set gazebo parameters to have a publish_rate of 1000Hz. This is mainly because some C++ nodes are working at high frequency, and we need the clock to be in the order of the milliseconds.
The C++ nodes do not seem to suffer noticeably from this high frequency messaging, while the python nodes really go crazy instead.
I tested this with an empty idle action server node and an empty gazebo world, and these are the results using different publish rates for gazebo:
publish_rate → action server node CPU load
1000Hz → ~50%
100Hz → ~12%
10Hz → ~2%
1Hz → 0%
(tested on a laptop with an intel i7 2.60GHz CPU)
This is really a terrible handling of the simulation time. I find it crazy that the deserialization of something so basic as a Time message would really take this much CPU…
Does anyone have any suggestions for dealing with this situation?
Our C++ nodes still need a high clock update rate. We could of course port all our python nodes to C++, but it is normally very convenient to have some high level stuff in python for quick development iterations…