Sadly, rclpy is slow. This is not only a Python problem, but also a problem of the underlying rclpy implementation. (hint: a Python based events executor is currently being developed by us, but is a story for another day).
TF2 is used in many nodes. The amount of callbacks that are triggered by TF2 can be very high, especially if there are multiple sources of TF data and you operate on a high frequency. A simple ROS 2 rclpy node can easily max out a CPU core just by processing TF data. This is especially unlucky, if TF is only used at a few places in the code for e.g. low frequency navigation or behavior operations.
This package aims to solve this problem by moving the TF buffer and listener to a C++ node that shares the process with the rclpy node. That way, the TF callbacks are processed in C++ and the Python node only needs to query the buffer for the latest transforms using a simple and performant pybind11 interface when needed.
While spinning up an additional Node is not ideal, the performance gain is significant and the overhead of the additional node is negligible compared to the performance gain. I don’t have reliable performance trials, but it brought down the CPU utilization of many nodes in our stack from 100% to ~20% CPU.
In addition to that, this solution also reduces the amount of executor deadlock scenarios and enables the usage of a single threaded executor for the rclpy node instead of the multi-threaded executor in many cases, resulting in further performance gains.
We use it in-house for a few months now, and it runs very smooth.
Interesting! We worked around this problem by launching a central BufferServer and using services to query for transforms. We made a PR here but it got stuck seemingly due to lack of interest.
Great idea! As an inspiration, I’ll also link our (still ROS 1-only) tf2_server, which is based on the idea that you can configure multiple “TF streams” defined by the subtrees they should contain, and downstream nodes then just subscribe to these substreams. The greatest savings are in high-level nodes like mapping which do not need to receive e.g. joint positions because they only care about the subtree above base_link.
I am also familiar with the BufferServer. Which is cool, but the use case differs slightly and ours is more of a drop-in replacement where you don’t need to think about the launch of another shared component or executor (service calls in callbacks can be a bit tricky sometimes) too much. Still, for low volume queries from many nodes yours is probably better, while our probably has a lower latency / cost per query. But this is just a guess, so I could be wrong.