High cpu load for simple python nodes

Yes to be more specific, the behavior I have seen is that python nodes that subscribe or publish to topics at a high rate (ex ~30Hz) end up capping a core at 100%, whereas the same usage for a C++ node is negligible.

I have looked a little bit into the code that handles the message parsing for Python. It seems to me pretty inefficient. The C library, for each message received:

  1. Imports the correct Python module
  2. Looks up the appropriate Python class
  3. Instantiates a python message object
  4. Converts the C fields into Python objects
  5. Assigns the C values to each Python field
  6. Destroys all those temporaries

It seems like it would be much more efficient to bake the struct definition into the C python module, then do all the struct building in the C library, and only return a single Python object. For example, this is a little C benchmark to exercise the deserialization of a simple char message:

#include <_char_s.c>

const int num_parse = 1000000;

int main(int argc, char *argv[]) {
	Py_Initialize();
	std_msgs__msg__Char value;
	value.data = 10;

	struct timespec start, end;
	clock_gettime(CLOCK_MONOTONIC_RAW, &start);

	PyObject *py_message = std_msgs__msg__char__convert_to_py(&value);
	printf("%p\n", py_message);
	time_t start_time;
	time(&start_time);
	for (int i = 0; i < num_parse; ++i) {
		py_message = std_msgs__msg__char__convert_to_py(&value);
	}

	clock_gettime(CLOCK_MONOTONIC_RAW, &end);

	uint64_t delta_us = (end.tv_sec - start.tv_sec) * 1000000 +
			    (end.tv_nsec - start.tv_nsec) / 1000;

	printf("Total time in millisecs: %lu\n", delta_us);
	printf("Per parse: %f ms\n", (double)delta_us / num_parse);
	return 0;
}

The results:

Total time in millisecs: 2540432
Per parse: 2.540432 ms

So 2.5ms to parse a single integer. And this is only the deserialization itself, there is probably overhead in the executor, the trampoline to execute the callback etc…

(To compile it:

gcc test_deserialization.c -I/opt/ros/foxy/lib/python3.8/site-packages/std_msgs/msg -I/usr/include/python3.8 -I/opt/ros/foxy/include -lpython3.8

)

6 Likes

Not adding too much to the discussion here, but I consider 30 Hz to be relatively low rate in robotics. I would expect to use this kind of rate in some snappy controllers for navigation on wheeled robots.

I think this issue sounds huge. In ROS 1 you could say Python and C++ were an equal choice (minus nodelets and some libraries), while from the discussion here it seems it’s not the case anymore for ROS 2.

4 Likes

What is the technical explanation for this?

According to my benchmark above, we spend on average ~1ms for each message just looking up the correct message class in the dynamically loaded Python module. I think this is a large, low-hanging fruit: we should import the message and message class only once when importing the Python message modules. I will try to submit a PR for this.

I think the second source of inefficiency is that instead of having a Python wrapper over a C struct, the code instantiates a Python object in the C code and performs conversions (regular long to PyLong and so on) for every field! However, fixing this would require a much larger refactor of the way messages are handled in Python. Not something I’m opposed to, but would require more work.

It could also be nice to have the opinion of some of the Python maintainers on how / why this design came to be :thinking:

6 Likes

I have also encountered this issue. It can be easily re-created by running the demo_nodes_py listener example:

  1. Run ros2 run demo_nodes_py listener
  2. Publish a message to the /chatter topic at varying frequencies
  3. Compare CPU usage to ros2 run demo_nodes_cpp listener

When running this on an arm platform with a publish rate of 100 Hz the Python listener node used over 50% of a single core, while the C++ listener used under 10%. This difference surprises me for such a simple node.

Running the demo_nodes_py talker and demo_nodes_cpp talker nodes yields similar results. However, running a Python proto publisher (ROS-agnostic) at 100Hz yields similar CPU usage to the demo_nodes_cpp talker node at 100Hz.

5 Likes

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.

Hi, I just stumbled upon this thread and noticed a pretty big mistake in your calculations. What you are actually measuring and printing are not milliseconds but microseconds. Therefore the time per parse is in the order of 2.5μs which is three orders of magnitude less than you stated and should be fast enough to not actually be a problem.

4 Likes

Good catch @timonegk , I actually tinkered with the rospy code a bit, and the deserialization of simple messages is quite fast. I found some other issues with the way the executors work, but it’s a bit hard to disentangle everything… I hope I can find the time to submit a proper PR!

1 Like

@Cam_Buscaron said I should chime in to suggest trying more cores😁 That is what Cam is doing with Ampere Altra 32 to 128 core arm CPUs. More powerful than Xeon, more efficient than RPi4.

3 Likes

I just also bumped into this issue in a node that received multiple messages and provides them via sockeIO to an webapp. This node has no requirement to instantly react. Therefore I programmed a custom executor that limits the message processing to 10 Hz. The result was a drop of CPU usage from 80% to 8%.

Perhaps this information is useful to someone else…

Hi @joespeed, we approve:

cpu-stress-test

3 Likes