High cpu load for simple python nodes

Marco · November 15, 2022, 5:43pm

I have been using in a project various nodes, some of which in C++, and some in Python. I noticed a huge difference, even for basic nodes that just do some publishing, in CPU usage between Python and C++. Python CPU usage is generally several times higher than a C++ node doing the same.
This happens for publisher nodes, as well as for server nodes, even in their idle state when no services or actions are actually requested, which I find extremely strange…

Did anyone notice similar issues? Does anyone have any recommendations for reducing the CPU usage of Python nodes?
The issue has been observed so far in ROS2 Galactic, under different CPU architectures (both amd and arm)

aposhian · November 15, 2022, 5:47pm

I have observed this as well. For this reason, I only use rclpy for development, and never for production nodes.

matthews-jca · November 15, 2022, 6:16pm

Not just CPU overhead when spinning, but memory footprint as well. Mirroring Aposhian we usually prototype in python and then convert to C++ for release

tomoyafujita · November 15, 2022, 6:17pm

@Marco can you create the issue on GitHub - ros2/rclpy: rclpy (ROS Client Library for Python) ? with description.

Or maybe we already have similar issue?

CC: @aposhian

facontidavide · November 15, 2022, 7:29pm

Every single time I observed that, it was a busy loop that could be easily avoided (while loop without any sleep or blocking operation).
There was always a simple solution to avoid that. If you share the code of the main loop, I can give you some hints

pateco · November 15, 2022, 8:39pm

Can you share the difference you see in CPU and memory usage when you run

docker run --rm -it osrf/ros:humble-desktop ros2 run demo_nodes_py talker

vs

docker run --rm -it osrf/ros:humble-desktop ros2 run demo_nodes_cpp talker
?

Katherine_Scott · November 15, 2022, 9:39pm

This would make a great ROS 2 Docs article.

MoffKalast · November 17, 2022, 4:44pm

At the risk of sounding callous, you guys do realize you’re comparing a compiled language known for its memory efficiency and speed to one that’s none of those three things. The virtual machine will add overhead, so will the interpreter, and of course the language itself needs more processor cycles to run, given dynamic typing and other high level abstractions. Even if you’re not receiving anything there’s still a thread in the background that needs to periodically poll for new socket data I presume. Unlike C++ which will strip out all unused code at compile time, Python also needs to keep all imported modules loaded into memory.

But none of us are using Python because it’s fast or memory efficient, we’re using it because it cuts down development time by an order of magnitude

Frederik_Beaujean · November 21, 2022, 8:34am

I would have to check again but I have noticed this same ssue with the CLI tools such as ros2 topic or ros2 bag record
that are implemented in python. I believe one basic issue is the executor used in this setting that has the busy loop (as @facontidavide mentioned) that can eat up a whole CPU even if if there is little actual work to be done. I don’t think it’s to just get rid off, though

facontidavide · November 21, 2022, 9:08am

At the risk of sounding callous, you guys do realize you’re comparing a compiled language known for its memory efficiency and speed to one that’s none of those three things.

@MoffKalast my first reaction was to think the same and say: “let’s convert this to C++, of course”.

But I was always wrong. If you observe a process that takes 100% of CPU and your intuition is telling you that should be below 10%, then your intuition is usually right and there is a busy loop.

This, at least, is my experience (and I am a hardcore C++ developer obsessed with optimization )

facontidavide · November 21, 2022, 9:09am

@Marco you started an animated conversation. Could you share your code or a similar example, to give us the opportunity to give you and other users proper advice?

RobotDreams · November 21, 2022, 4:05pm

I think that the “python node taking 100% of CPU” must not have a subscriber or timer callback to give spin a chance to block the node?

As for ros2 topic taking alot of CPU? On my Galactic, ros2 topic echo /battery_state which publishes ever 10 seconds takes around 1% of CPU but only when it is active:

MoffKalast · November 21, 2022, 7:40pm

Ah yeah once something caps out a core you’re either doing something wrong or processing more of something than you have the capacity to and it may be time to multithread. If that’s the really case then it’s surely a problem.

But in my experience the rule of thumb for non-busy code is that if something takes 1-2% of CPU in C++ that’ll likely still be around 9-15% in python. Usually that trade-off is still acceptable.

peci1 · November 21, 2022, 10:05pm

It would be great if someone could run the problematic node with a profiler, like the one Pycharm pro offers. That could shed some light on where the problem is…

Jason_Beach · November 23, 2022, 2:32am

I have an intel core i7-7700HQ w/ 16GB memory, running Galactic, for python:

and C++

So yes the python is using more (.7% vs .0%), but not anything egregious. Same with memory.

Marco · November 23, 2022, 12:56pm

Sorry for the late reply.
Thanks a lot everyone for all the great feedback!
I see there are contrasting opinions on the topic, so I started to analyze things a bit more in detail on our code base.
I found out, surprisingly, that the high CPU usage I had noticed appears only in our testing pipelines with launch_testing.
There I have e.g. an idle action server node that listens for action goals and does nothing else, with a more or less constant CPU usage of that specific process of about 40% of a core (and this huge load for an idle process is the main reason why I started investigating potential performance issues with rclpy).
I now tried to analyze the same idle node when run individually via ros2 run, and the CPU load of that same process is instead close to 0%.
As a next step I will look further into whether the cause is launch_testing, something in our pipelines, or somehow the effect of the other nodes that are launched in parallel in our tests.
I’ll update you on these findings as soon as I get some results, and if it is still unclear I’ll also share the code of the problematic nodes with you.
Might take a few days, since in parallel I am also working on several other topics, so I apologize in advance for the wait.

haudren · November 25, 2022, 12:39am

Although far from 100%, we have seen high CPU usage by nodes that don’t do much. For example a simple node that just subscribes to a high-frequency topic (e.g. /clock in simulation) can consume about 10-15% of a single CPU… I would need to profile more, and on a recent distribution (this was Foxy), but I suspect that there are quite a lot of inefficiencies in the executor and / or the deserialization of messages.

Bernd_Pfrommer · November 25, 2022, 1:37am

Serialization and deserialization of non-primitive elements (such as ROS time) in ROS2 is slow in C++, and can be terribly slow when using Python. Because it involves deserialization, the simple command “ros2 topic hz” can choke up so badly that it is close to useless for measuring the actual frequency at which a topic is published, at least for high frequency messages containing non-primitive types. This sad fact is re-learned by every ROS2 newby (like me). Please see below link for more.

github.com/ros2/rmw_cyclonedds

slow publishing and performance for custom messages with large arrays

opened 08:48PM - 14 Oct 21 UTC

closed 09:47AM - 16 Oct 21 UTC

berndpfrommer

## Bug report **Required Info:** - Operating System: - Ubuntu 20.04 - …Installation type: - ROS2 galactic standard ubuntu package installed via apt - Version or commit hash: This is what the apt package info says: `` ros-galactic-cyclonedds/focal,now 0.8.0-5focal.20210608.002038 amd64 [installed,automatic]`` - DDS implementation: - cyclonedds - Client library (if applicable): - rclcpp - Hardware: AMD Ryzen 7 4800H with 64GB of memory #### Steps to reproduce issue I have made a very small repo with the below demo code and instructions how to run: https://github.com/berndpfrommer/ros2_issues Here is the source code for the publisher: ``` #include <unistd.h> #include <rclcpp/rclcpp.hpp> #include <ros2_issues/msg/test_array_complex.hpp> #include <thread> template <class MsgType> struct TestPublisher : public rclcpp::Node { explicit TestPublisher(const rclcpp::NodeOptions & options) : Node("test_publisher", options) { pub_ = create_publisher<MsgType>( "~/array", declare_parameter<int>("q_size", 1000)); thread_ = std::thread([this]() { rclcpp::Rate rate(declare_parameter<int>("rate", 1000)); const int numElements = declare_parameter<int>("num_elements", 100); rclcpp::Time t_start = now(); size_t msg_cnt(0); const rclcpp::Duration logInterval = rclcpp::Duration::from_seconds(1.0); while (rclcpp::ok()) { MsgType msg; msg.header.stamp = now(); msg.elements.resize(numElements); pub_->publish(msg); rate.sleep(); msg_cnt++; rclcpp::Time t = now(); const rclcpp::Duration dt = t - t_start; if (dt > logInterval) { RCLCPP_INFO(get_logger(), "pub rate: %8.2f", msg_cnt / dt.seconds()); t_start = t; msg_cnt = 0; } } }); } // -- variables typename rclcpp::Publisher<MsgType>::SharedPtr pub_; std::thread thread_; }; int main(int argc, char * argv[]) { rclcpp::init(argc, argv); auto node = std::make_shared<TestPublisher<ros2_issues::msg::TestArrayComplex>>( rclcpp::NodeOptions()); rclcpp::spin(node); rclcpp::shutdown(); return 0; } ``` It uses the following custom message for ``TestArrayComplex``: ``` std_msgs/Header header # test array of elements TestElement[] elements ``` and the TestElement of the array is defined as: ``` uint16 x uint16 y builtin_interfaces/Time ts bool polarity ``` #### Expected behavior Under ROS1 I can publish 1000 msgs/sec with 100,000 elements per message *and* receive at a rate of 1000Hz with ``rostopic hz`` #### Actual behavior Under ROS2 (galactic), already the publishing fails to keep up at a message size of 5,000 elements. Running the publisher with ``` RMW_IMPLEMENTATION=rmw_cyclonedds_cpp ros2 run ros2_issues publisher_node --ros-args -p num_elements:=5000 -p rate:=1000 ``` produces this output: ``` 634242922.738549 [0] publisher_: using network interface wlo1 (udp/192.168.1.234) selected arbitrarily from: wlo1, virbr0, virbr1, docker0 [INFO] [1634242923.747148904] [test_publisher]: pub rate: 369.34 [INFO] [1634242924.749220049] [test_publisher]: pub rate: 373.22 ... ``` So not even the publishing is full speed, without any subscriber to the topic. I see the publisher running at 100 %CPU, so something is really heavy weight about publishing. Worse, running ``rostopic hz`` shows a rate of about 30 msg/s. ``` RMW_IMPLEMENTATION=rmw_cyclonedds_cpp ros2 topic hz -w 100 /test_publisher/array 1634243031.116776 [0] ros2: using network interface wlo1 (udp/192.168.1.234) selected arbitrarily from: wlo1, virbr0, virbr1, docker0 average rate: 31.275 min: 0.030s max: 0.041s std dev: 0.00314s window: 33 average rate: 30.502 min: 0.030s max: 0.044s std dev: 0.00351s window: 63 ``` This is what I get from rostopic bw. The size of the message (about 80kb) agrees with what I computed by hand: ``` 5.20 MB/s from 100 messages Message size mean: 0.08 MB min: 0.08 MB max: 0.08 MB ``` I tried ``sudo` sysctl -w net.core.rmem_max=8388608 net.core.rmem_default=8388608`` and also was able to restrict the interface to loopback (lo) but no improvement. FastRTPS is a bit better, at least here I can send messages with up to 50,000 elements before it falls off at 110,000 messages: ``` RMW_IMPLEMENTATION=rmw_fastrtps_cpp ros2 run ros2_issues publisher_node_complex --ros-args -p num_elements:=110000 -p rate:=1000 [INFO] [1634243372.579736663] [test_publisher]: pub rate: 404.43 [INFO] [1634243373.581322637] [test_publisher]: pub rate: 391.37 [INFO] [1634243374.583210605] [test_publisher]: pub rate: 414.22 ``` But if I send messages of size 5,000, ``rostopic hz`` also shows about 30Hz, similar to rmw_cyclonedds_cpp. #### Additional information This is a show stopper for porting e.g. a driver for an event based camera from ROS1 to ROS2, see here: https://github.com/berndpfrommer/metavision_ros_driver. The hardware is a 8-core AMD Ryzen laptop, less than 1 year old, so definitely not a slow machine, and this is all running on-machine, no network traffic. To run the above code it is fastest to clone the [very small repo](https://github.com/berndpfrommer/ros2_issues) linked above.

Marco · November 25, 2022, 7:24am

Hi everyone.
After some more investigation I found that it really is the gazebo simulation that causes the peaking CPU usage in python nodes.
As @haudren pointed out, this really seem to be caused by python nodes listening to the sim_time coming from gazebo.
In our setup in particular, we had set gazebo parameters to have a publish_rate of 1000Hz. This is mainly because some C++ nodes are working at high frequency, and we need the clock to be in the order of the milliseconds.
The C++ nodes do not seem to suffer noticeably from this high frequency messaging, while the python nodes really go crazy instead.

I tested this with an empty idle action server node and an empty gazebo world, and these are the results using different publish rates for gazebo:
publish_rate → action server node CPU load
1000Hz → ~50%
100Hz → ~12%
10Hz → ~2%
1Hz → 0%
(tested on a laptop with an intel i7 2.60GHz CPU)

This is really a terrible handling of the simulation time. I find it crazy that the deserialization of something so basic as a Time message would really take this much CPU…

Does anyone have any suggestions for dealing with this situation?
Our C++ nodes still need a high clock update rate. We could of course port all our python nodes to C++, but it is normally very convenient to have some high level stuff in python for quick development iterations…

Marco · November 25, 2022, 8:46am

If anyone would like to try to reproduce this, here is the code for the minimal action server (it uses a nav2 action, just because I did not see any standard ros2 actions yet):


import rclpy
from rclpy.node import Node
from rclpy.action import ActionServer
from nav2_msgs.action import Wait


class MinimalActionServerNode(Node):
    def __init__(self):
        super().__init__(
            node_name="MinimalActionServerNode",
        )
        self._minimal_action_server = ActionServer(
            self,
            Wait,
            "random_action_name",
            execute_callback=self.exec_callback,
        )
        self.get_logger().info("Initialized!")

    def exec_callback(self, goal_handle_):
        return Wait.Result()


def main(args=None):
    rclpy.init(args=args)
    node = MinimalActionServerNode()

    try:
        rclpy.spin(node)
    except KeyboardInterrupt:
        node.get_logger().info("MinimalActionServerNode interrupted via keyboard")
    rclpy.shutdown()


if __name__ == "__main__":
    main()

Topic		Replies	Views
SingleThreadedExecutor creates a high CPU overhead in ROS 2 Next Generation ROS ros2	25	11097	April 2, 2020
Looking for feedback on a ROS C++ & python init/shutdown wrapper General noetic	1	736	February 25, 2021
Reducing ROS 2 CPU overhead by simplifying the ROS 2 layers Next Generation ROS	11	7533	October 15, 2021
ROS2 speed General ros2 , foxy	31	21549	November 3, 2021
Impact of ROS 2 Node Composition in Robotic Systems, RA-L General	4	1137	June 18, 2023

High cpu load for simple python nodes

Related topics