How to use callback groups in ROS2

Hello ROS developers,

Some time ago, we at Karelics migrated our robot software from ROS1 to ROS2 and needed to learn quite a few new things in the process. As it usual happens when learning something new, the migration process also did not go without hiccups. In particular, there were a few ROS2 concepts that we had our share of struggles with at the beginning. We now wish to share some of our experiences and learned lessons with the ROS community to both spark up discussions about best practices etc. and also to hopefully help others who may be facing similar issues.

This post discusses our team’s experiences and findings about callback groups and how to correctly utilize them to prevent deadlocks and other unwanted surprises in your software. We hope that this info will be useful particularly to newer developers learning ROS2. You can also find our original blog post along with other developer stories on our website.

While the discussion and examples below only reference rclpy, we make the following remark about the C++ side or ROS2: While we have not verified the examples presented in this post with rclcpp, by our knowledge, our main points also hold there.

Without further ado, let’s get going.

Basics of executors and callback groups

For the sake of completeness, let us start with some basic definitions and concepts related to executors and callback groups in rclpy. Feel free to skip this section or parts of it if you are already familiar with the topics.

Executors in rclpy

By default, rclpy offers two different executors for the user to choose from:

  • SingleThreadedExecutor (default)
  • MultiThreadedExecutor

SingleThreadedExecutor is quite straightforward: It executes callbacks in a single thread, one at a time, and thus the previous callback must always finish before a new one can begin execution.

MultiThreadedExecutor, on the other hand, is capable of executing several callbacks simultaneously. While the presence of GIL in Python means that it is still not able to utilize truly concurrent execution with multiple CPU cores, it does provide the often useful possibility of having different callback executions overlap with each other and thus also allows waiting in a callback, something that SingleThreadedExecutor simply cannot achieve. Here, one should note the wording “is capable of,” hinting that the parallel execution is not a given (more on that in the following sections).

SingleThreadedExecutor doesn’t really care about any callback group options. Thus we will not discuss it further, and the following topics only concern the MultiThreadedExecutor.

Callback groups for MultiThreadedExecutors

Rclpy offers two different callback groups to work with in combination with MultiThreadedExecutors:

  • MutuallyExclusiveCallbackGroup
  • ReentrantCallbackGroup

MutuallyExclusiveCallbackGroup allows the executor to execute only one of its callbacks simultaneously, essentially making it as if the callbacks in the group were executed by a SingleThreadedExecutor. Thus, it is a good choice to put any callbacks accessing critical and potentially non-thread-safe resources in the same MutuallyExclusiveCallbackGroup.

ReentrantCallbackGroup allows the executor to schedule and execute the group’s callbacks in any way the executor sees fit, without restrictions. This means that, in addition to different callbacks being run simultaneously, the executor can also execute different instances of the same callback simultaneously. An example case of this would be if the execution of a timer callback takes longer than the timer’s firing period (although that particular case should be avoided at all costs in rclpy, but that’s another story).

Using callback groups to control execution and avoid deadlocks

Before we continue, it is important to understand a couple things about callbacks from rclpy’s perspective. In the context of rclpy and executors, a callback means a function whose scheduling and execution is handled by an executor. Some examples of callbacks in this context are

  • subscription callbacks (receiving and handling data from a topic),
  • timer callbacks,
  • service callbacks (for executing service requests in a server),
  • different callbacks in action servers and clients,
  • done-callbacks of Futures.

Below are a couple important points about callbacks that should be kept in mind when working with callback groups.

  • Almost everything in ROS 2 is a callback! Every function that is run by an executor is, by definition, a callback. The non-callback functions in a ROS 2 system are found mainly at the edge of the system (user and sensor inputs etc).
  • Sometimes the callbacks are hidden and their presence may not be obvious from the user/developer API provided in rclpy. This is the case especially with any kind of “synchronous” call to a service or an action. For example, the synchronous call Client.call(request) to a service adds a Future’s done-callback that needs to be executed during the execution of the function call, but this callback is not directly visible to the user.

Controlling execution with callback groups

In order to control execution with callback groups, one can consider the following guidelines (by no means an all-inclusive list!).

  • Register callbacks accessing critical non-thread-safe resources in the same MutuallyExclusiveCallbackGroup (or protect the resources by locks manually).
  • If you have a callback whose execution instances need to be able to overlap with each other, register it to a ReentrantCallbackGroup.
  • If you have callbacks that require to be potentially executed in parallel to one another, register them to
  • a ReentrantCallbackGroup, or
  • different MutuallyExclusiveCallbackGroups (this option is good if you want the callbacks to not overlap themselves, or also need or want thread safety with respect to some other callbacks).
  • Note that the last point in the list is a valid way of allowing parallel execution for different callbacks, and can even be more desirable than simply registering everything into one ReentrantCallbackGroup.

Avoiding deadlocks

Every ROS(2) developer knows that making synchronous calls to services or actions in a callback is bad and can lead to deadlocks… Except that this is not quite the whole truth. While using asynchronous calls (and consequently explicitly registering done-callbacks to futures) is indeed safer and allows things to work as intended even with SingleThreadedExecutors, synchronous calls can also be made to work in callbacks as long as the node’s callback group setup is done correctly. The obvious benefit of synchronous calls is that they make the code look cleaner and easier to understand, so let us see how to make them work without risk of deadlocks.

First thing to note here is that every node’s default callback group is a MutuallyExclusiveCallbackGroup. If the user does not specify any other callback group when creating a timer, subscription, client etc., any callbacks created then or later by these entities will use the node’s default callback group. Furthermore, if everything in a node uses the same MutuallyExclusiveCallbackGroup, that node essentially acts as if it was handled by a SingleThreadedExecutor, even if a multi-threaded one is specified! Thus, whenever one decides to use a MultiThreadedExecutor, some callback groups should always be specified in order for the executor choice to make sense.

With the above in mind, here are a couple guidelines to help avoid deadlocks:

  • Use asynchronous calls (if you always use only these everywhere, there should never be a deadlock).
  • If you make a synchronous call in any type of a callback, this callback and the client making the call need to belong to
    • different callback groups (of any type), or
    • a ReentrantCallbackGroup.

Failing the latter point will always cause a deadlock. An example of such a case would be making a synchronous service call in a timer callback (see the next section).

Examples

Let us look at some simple but hopefully enlightening examples about different callback group setups. The following demo code considers calling a service synchronously in a timer callback.

You can find a Github link to the full script here. Feel free to check it out, try it yourself and expand it for further experimentation. You could, for example, add a publisher to the service node (perhaps publish when the service is called), subscribe to it in the client node, maybe add some sleeps and see what happens with different callback groups. As a curiosity, try making the subscriber never receive anything even though the service and timer are working.

Let us first go through the main parts of the code. We have a node providing a simple mock service:

class ServiceNode(Node):
    def __init__(self):
        super().__init__('mock_service_node')
        self.srv = self.create_service(Empty, 'test_service', callback=self.service_callback)

    def service_callback(self, request, result):
        self.get_logger().info('Server received request')
        return result

and a node with a client to the above service, along with a timer (unless we specify our desire for more direct control):

class CallbackGroupDemo(Node):
    def __init__(self, client_cb_croup, timer_cb_group, manual_calls):
        super().__init__('callback_group_demo_node')
        self.client = self.create_client(Empty, 'test_service', callback_group=client_cb_croup)
        if not manual_calls:
            self.call_timer = self.create_timer(1, self.timer_cb, callback_group=timer_cb_group)

    def call_srv(self, delay: float = 1):
        sleep(delay)
        self._call_srv()

    def _call_srv(self):
        self.get_logger().info('Client sending request')
        _ = self.client.call(Empty.Request())
        self.get_logger().info('Client received response')

    def timer_cb(self):
        self._call_srv()

We then have a couple helper functions for spinning the service node and making “manual” calls from outside of the executor-realm:

def spin_srv(executor):
    try:
        executor.spin()
    except rclpy.executors.ExternalShutdownException:
        pass

def call_srv_manually(client_node):
    client_node.call_srv()
    client_node.get_logger().info('Test finished successfully.\n')

And finally the actual demo script, taking the setup arguments: callback groups for the client and timer and whether we want to send service calls manually instead of using the timer. Here, we create the above nodes, assign them to executors (MultiThreadedExecutor for the demo node) and make them spin. Finally, we clean up.

def run_test(client_cb_group, timer_cb_group, manual_call):
    rclpy.init()

    node = CallbackGroupDemo(client_cb_croup=client_cb_group, timer_cb_group=timer_cb_group, manual_calls=manual_call)
    executor = MultiThreadedExecutor()
    executor.add_node(node)

    service_node = ServiceNode()
    srv_executor = SingleThreadedExecutor()
    srv_executor.add_node(service_node)
    srv_thread = Thread(target=spin_srv, args=(srv_executor, ), daemon=True)
    srv_thread.start()

    call_thread = Thread(target=call_srv_manually, args=(node, ), daemon=True)
    if manual_call:
        call_thread.start()

    try:
        print("")
        node.get_logger().info('Beginning demo, end with CTRL-C')
        executor.spin()
    except KeyboardInterrupt:
        node.get_logger().info('KeyboardInterrupt, shutting down.\n')
    node.destroy_node()
    service_node.destroy_node()
    rclpy.shutdown()
    try:
        srv_thread.join()
    except KeyboardInterrupt:
        pass    
    if manual_call: 
        call_thread.join()

Note that setting either of the callback group options to None will cause the respective callback(s) to be assigned into the node’s default MutuallyExclusiveCallbackGroup.

Let us then look at what happens when running the test with different options. We start by just using the default callback group for everything. We see that if we call the service manually (not from a callback), everything works: run_test(client_cb_group=None, timer_cb_group=None, manual_call=True) outputs

[INFO] [1649233901.626448480] [callback_group_demo_node]: Beginning demo, end with CTRL-C
[INFO] [1649233902.622004759] [callback_group_demo_node]: Client sending request
[INFO] [1649233902.624205200] [mock_service_node]: Server received request
[INFO] [1649233902.627408435] [callback_group_demo_node]: Client received response
[INFO] [1649233902.627800491] [callback_group_demo_node]: Test finished successfully.
^C[INFO] [1649233905.490485451] [callback_group_demo_node]: KeyboardInterrupt, shutting down.

This is because there is only one callback running on the client side: the (hidden) done-callback of the get_result_future, so it cannot be blocked by anything. But, the situation changes if we try using the timer for the service calls: run_test(client_cb_group=None, timer_cb_group=None, manual_call=False) outputs

[INFO] [1649233905.498086234] [callback_group_demo_node]: Beginning demo, end with CTRL-C
[INFO] [1649233906.497072443] [callback_group_demo_node]: Client sending request
[INFO] [1649233906.497673382] [mock_service_node]: Server received request ← DEADLOCK
^C[INFO] [1649233909.797191396] [callback_group_demo_node]: KeyboardInterrupt, shutting down.

The deadlock is caused by the timer callback and the above-mentioned done-callback being in the same (node’s default) MutuallyExclusiveCallbackGroup. Because the timer callback blocks execution until the result of the service call is received, the done-callback never gets to execute, and consequently the service call never completes.

So how do we solve the above issue? If we insist on using the synchronous call, we have two options: separate the timer and service callbacks to different callback groups (of any type) or put them into one ReentrantCallbackGroup. In the former case, we can replace either or both of the groups by ones set by ourselves. Thus all of the following test cases give the same output:

run_test(client_cb_group=MutuallyExclusiveCallbackGroup(), timer_cb_group=None, manual_call=False) 

run_test(client_cb_group=None, timer_cb_group=MutuallyExclusiveCallbackGroup(), manual_call=False)  

group1 = MutuallyExclusiveCallbackGroup()
group2 = MutuallyExclusiveCallbackGroup()
run_test(client_cb_group=group1, timer_cb_group=group2, manual_call=False)

cb_group = ReentrantCallbackGroup()

run_test(client_cb_group=cb_group, timer_cb_group=cb_group, manual_call=False)

Output:

[INFO] [1649233909.812419539] [callback_group_demo_node]: Beginning demo, end with CTRL-C
[INFO] [1649233910.812076113] [callback_group_demo_node]: Client sending request
[INFO] [1649233910.812990099] [mock_service_node]: Server received request
[INFO] [1649233910.815389239] [callback_group_demo_node]: Client received response
[INFO] [1649233911.811726650] [callback_group_demo_node]: Client sending request
[INFO] [1649233911.812350167] [mock_service_node]: Server received request
[INFO] [1649233911.813846297] [callback_group_demo_node]: Client received response
^C[INFO] [1649233912.312435736] [callback_group_demo_node]: KeyboardInterrupt, shutting down.

As you can see, the timer also now keeps firing repeatedly (as expected), whereas previously the first execution got stuck and blocked any further executions of the timer callback.

The last demo case is to see what happens if we replace both callback groups with the same MutuallyExclusiveCallbackGroup (one that’s different from the node’s default one). The result is the same as when using the default group:

cb_group = MutuallyExclusiveCallbackGroup()
run_test(client_cb_group=cb_group, timer_cb_group=cb_group, manual_call=False)

outputs

[INFO] [1649233905.498086234] [callback_group_demo_node]: Beginning demo, end with CTRL-C
[INFO] [1649233906.497072443] [callback_group_demo_node]: Client sending request
[INFO] [1649233906.497673382] [mock_service_node]: Server received request ← DEADLOCK
^C[INFO] [1649233909.797191396] [callback_group_demo_node]: KeyboardInterrupt, shutting down.

The reason why this didn’t work is that the future’s done-callback - which is the critical callback getting blocked by the timer callback - is assigned to the future by the service client. Thus, that callback will use the same callback group as the client does. If it would use the node’s default one instead, things would work, at least in this particular case.

When we used two different MutuallyExclusiveCallbackGroups, it worked because the timer callback (the one being blocked by the service call) was in a different group than the client (who relayed its callback group to the Future’s done-callback). Thus, the done-callback was able to execute parallel to the timer callback, the client returned the result of the service call, and the timer callback was able to finish (and execute again the next time the timer fired)!

Epilogue

We hope that this post will be helpful to anyone struggling to find a correct callback group setup to make their ROS 2 system work efficiently and without hiccups. In particular we wish for deadlock-free development for the whole of the ROS 2 community going forward! Feel free to also share your thoughts on the topic, so we can all learn from each other.

24 Likes

@tanelikor Thanks so much for this! I think this is extremely valuable, and would be a great addition to our overall documentation at ROS 2 Documentation — ROS 2 Documentation: Rolling documentation . Would you be interested in making a PR to GitHub - ros2/ros2_documentation: ROS 2 docs repository? That way it will be a permanent part of the documentation.

5 Likes

I often manage to understand some undocumented concepts by looking at the source code but callback groups was not one of those, thank you for the explanations!

1 Like

Thanks for sharing! +1 to adding this to the documentation.

@tanelikor Very nice write-up. Thank you! I second @clalancette’s proposal to add this to the ROS 2 Documentation. The Concept → Executors page contains a short paragraph on callback groups, but callback groups are definitely worth their own page or a joint page ‘Executors and Callback Groups’. Also, the current Executor page is missing rclpy. If you decide for a PR to ros2_documentation, I’m happy to review it.

1 Like

Thanks for the kind feedback! Sure, I can turn this into a doc page and make a PR for it. This might take a bit of time as I have a couple busy days ahead, but I’ll get to it soon.

I just had the same feeling in this particular case. I was really used to Callbackgroups in rclcpp but after a few days of trying with rclpy and reading the source code, came to the conclusion that Callbackgroups and particularly adding them during the runtime of an existing node isn‘t an implemented feature in rclpy.

So thank you very much @tanelikor for showing me that this is a feature I can also use in rclpy. Now I don‘t have an excuse to stop my students from using Python… :smiley: :+1: