What is the expected behavior of rclcpp in case of an exception raised in a user callback?

Reposted from Exception handling in user callbacks? by meyerj · Pull Request #2017 · ros2/rclcpp · GitHub

I tried to find some information about this topic in the documentation, in the code, on GitHub, on Discourse, on ROS Answers, but failed to find something conclusive, or maybe used the wrong search terms. Only this post and this answer seem to be related. For the special case of service callbacks I remember having seen a discussion/feature request to forward exceptions to the caller as a special response like in ROS 1, but did not find it anymore now.

  1. User callbacks must never throw?

    They do. I triggered the case by using the ros1_bridge with a service server in ROS 1 and a client calling it from ROS 2: If the ROS 1 service is not available anymore, for example because the ROS 1 node died, the callback defined in ServiceFactory<ROS1_T, ROS2_T>::forward_2_to_1() throws a runtime error after the roscpp service call API returned false. Also any ROS 2 middleware can throw exceptions, I assume, when the user callback invokes a publisher or service client itself. Apparently it is even recommended to handle errors by throwing exceptions.

    So if the rule would be that user callbacks must handle exceptions internally, I guess ros1_bridge and numerous other node implementations would need to be fixed.

  2. Did I miss a place where this is already handled within rclcpp?

    Even rclcpp code itself may throw exceptions in the Executor code path while spinning, for example here.

    If that is not the case yet, maybe a per executor, per node or per context flag would be nice-to-have, that decides whether exceptions are unhandled like it seems to be the case now, or whether rclcpp catches and logs them internally. Or some mechanism to register a user callback that receives an std::exception_ptr and whose return value decides whether the executor continuous or aborts…

  3. Always catch exceptions when spinning?

    As a last resort, I wanted to patch the main loop of the dynamic_bridge (and other nodes), such that exceptions get logged, but the node does not terminate and continues to forward other topics and service calls. But that is not possible without the patch proposed here:

    // ROS 2 spinning loop
    rclcpp::executors::SingleThreadedExecutor executor;
    while (ros1_node.ok() && rclcpp::ok()) {
      try {
        executor.spin_node_once(ros2_node);
      } catch (std::exception& e) {
        // Log the exception and continue spinning...
      }
    }
    

    The problem is that it triggers the “Node has already been added to an executor” exception here in the next cycle after the exception, and hence keeps logging in a loop. So maybe the executor needs to be recreated to recover? Or I could call executor.remove_node(ros2_node) in the catch body as a workaround? That was the point where I started to investigate the problem and ended up here.

    The proposed patch would fix that, I think, by removing the node from the executor before the exception is rethrown to be handled in main() or whereever else spin_once() has been called from. I have not actually tested it yet by compiling rclcpp from source. I also may have missed other places where add_node() and remove_node() gets called in pairs. Maybe a better design would involve a RAII-style class that adds a node in its constructor and removes it again in its destructor? Seems like RCPPUTILS_SCOPE_EXIT() is meant exactly for those use cases and should be applied instead of my try/catch block, but I only discovered it while writing this.

    The same pattern that involves a loop with rclcpp::ok() and rclcpp::spin_once() directly in main() can be found in many other places, too, e.g. here. I am not sure whether rclpy is also affected, but in ROS2 Python examples the equivalent pattern is even dominant.

    For the more simple rclcpp::spin(node) call an extra loop would need to be added to keep spinning after an exception.

I can almost not believe that there is no foreseen or documented way to prevent that any minor fault terminates the whole process, or that this behavior is “by design”? I am sorry in case there is something more obvious, and I just missed it.

It is easy to reproduce the crash with the minimal_service example in ros2/examples, by adding a throw statement in the callback:

$ ros2 run examples_rclcpp_minimal_service service_main &
[1] 353822
$ ros2 service call /add_two_ints example_interfaces/srv/AddTwoInts "{}"
requester: making request: example_interfaces.srv.AddTwoInts_Request(a=0, b=0)

[INFO] [1663789664.837616992] [minimal_service]: request: 0 + 0
terminate called after throwing an instance of 'std::runtime_error'
  what():  some error
^C[1]+  Exit 250                ros2 run examples_rclcpp_minimal_service service_main
$ 
1 Like

@johannesmeyer thanks for opening the discussion here.

i might be missing something, but here are my comments.

User callbacks must never throw?

I think user callback can throw the exception, and application expects the exception via spin.

Even rclcpp code itself may throw exceptions in the Executor code path while spinning

yes, it does throw the exception.

Always catch exceptions when spinning?

if needed, application needs to use try statement to catch the exception. that is my understanding.

The problem is that it triggers the “Node has already been added to an executor” exception here in the next cycle after the exception, and hence keeps logging in a loop.

is this because the application tries to add node into the executor in the while loop?

The proposed patch would fix that

IMO, the patch makes sense to me.

I can almost not believe that there is no foreseen or documented way to prevent that any minor fault terminates the whole process, or that this behavior is “by design”?

i am not sure this is originally designed in that way, sorry.

but it would be nice to add the document about this behavior, if missing.

It is easy to reproduce the crash with the minimal_service example in ros2/examples, by adding a throw statement in the callback

yes, correct.

i think examples would be suitable to add try statement to catch the exception.

because user is likely to do copy&paste the example code into somewhere else.

best,

Tomoya

2 Likes

From my perspective, I think that there is no one correct answer, and it depends on your application.

For instance, I can easily imagine a high-assurance system where there is a health manager running in the system, and nodes that do the work. In that case, if one of the internals of a node throws, you want the process to exit and crash, since the health manager will notice this, log it, and restart the node.

On the other hand, I can also imagine a system where you just want the components to keep going for as long as you can, so it would be best if exceptions were caught in-process, handled, and the node continued on. The issue here is that this is highly node-specific; if the exception is thrown in certain places, it could leave the node in an inconsistent state, and so continuing on may do more harm than good.

So it is not clear to me that rclcpp and the executors should do anything about this. They don’t have enough knowledge of the node and the system to necessarily go on.

That said, I’d like to hear what others have to say about this, particularly those in production. How does everyone handle this?

1 Like

@clalancette I fully agree, and it should remain a user choice and not implied by the client library. So the question is more about what is a good default behavior, and what patterns should be propagated by examples.

It is not necessary that rclcpp or other client libraries handle all exceptions internally, as long as this is documented and the user has a chance to catch exceptions in main(), where typically rclcpp::spin() or its variants are called, and recover after having handled it. At the moment, without something along the lines of my proposed patch in Exception handling in user callbacks? by meyerj · Pull Request #2017 · ros2/rclcpp · GitHub, that is not possible because the executor is left in an inconsistent state. Independent of whatever other solutions this discussion may lead to, that is imho a bug because the pre- and post-conditions of calling spin() or spin_once() are not well defined then.

Another argument for always handling exceptions within the application by default is that most have some kind of internal state, or at least publisher and subscriber queues or latched topics. Having a complex application restarted by an outside health manager is hence disruptive, and may lead to missing messages or failed service calls elsewhere, although that could be unrelated to the exception that occured. For an application that runs multiple nodes and with different executors, they should influence each other as little as possible, no? Automatic respawning is also not the default behavior of launch, so in many current deployments crashes will probably be left unhandled and require manual intervention.

For the special case of the ros1_bridge I cannot imagine a use case where it would be desirable to let the node die and hence disrupt the flow of other topics and service calls when a ROS 1 service call fails, or when the middleware throws an exception because of a transport error. If there is no way to return exceptions and other errors as such to the service caller in ROS 2, besides adding that explicitly to the response, then the best option is to log the failure, to return no response to the ROS 2 caller and to always call services with a timeout?