Deferrable + Canceleable Lifecycle Transitions

Hi everyone,

At iRobot, we have been working on lifecycle node transitions and are looking toward the community w.r.t. designing more flexible transition functions (e.g., on_configure). Namely, transitions that can:

  1. be deferred (i.e., async)
  2. be cancelled

Core github issue describing this: rclcpp#2213

What is the current behavior / limitations of lifecycle node transitions?

Currently lifecycle transition functions require a CallbackReturn thus making them synchronous (i.e., we block on the executor thread until we return). This poses 2 specific issues:

  1. we cannot call services within an ongoing service without hacky workarounds (rclcpp#773 ; rclcpp#2057)
  2. we have no way of interacting with a node externally when it is in a transition.

How to handle dependencies/interoperability in lifecycle transitions?

I’ll use examples in this section to demonstrate the limitations/needs:

1. Deferral / Async Transition Need:

In the above example, we have:

  • (1) An external SupervisorNode that requests a change state on our LifecycleNode.
  • (2) This is then serviced by the executor, eventually calling the CallbackReturn on_configure(State) function.
  • (3) The LifecycleNode’s on_configure requires calling an external service - in this case a GetParameter where we want the battery_thresh.
  • (4) Our BatteryNode responds, putting the response in the EventsQueue for the executor to process when ready. However, the on_configure is still being serviced, waiting on this response to be processed. Therefore we are deadlocked.

A proposed alternative would be to pass some handle to the user. They can do what they wish (call services / spawn other threads etc) and send a response using the handle. Code examples for reference: rclcpp#2214.

2. Cancel Need:

)

In the above example, we have:

  • (1) A SupervisorNode that monitors the state of the overall system, bringing nodes up/down depending on their dependencies. It sees the dependency LCConsumer has on LCProducer and checks that LCProducer is ACTIVE.
  • (2) This is true
  • (3) Therefor it tries to transition the LCConsumer to ACTIVE`.
  • (4) As the transition starts for LCConsumer, LCProducer raises an error (ros2_design#283) and attempts recovery by running on_error(State) callback.
  • (5) LCConsumer is now stuck in ACTIVATING as it depends on a LCProducer service. We have no way to transition out of ACTIVATING from our SuperVisorNode’s point of view.

An ideal solution here would be to allow for these transitions to be canceleable. The user transition code (in the above example it would be LCConsumer::on_active(State&, DeferResponseHandle*))would be responsible for monitoring for cancels. If the user acknowledges the cancel request, they are responsible for cleaning up the ongoing transition and responding (i.e., a cooperative cancellation approach). Exact behavior can be found in the “More Detailed Expected Behaviors” section below. Code examples for reference rclcpp#2214.

Should lifecycle transitions be Actions?

You may notice these deferrable + canceleable transition function needs are entirely encompassed under ROS2 Actions. In our current proposed approach (see “Current Proposal” section below), we essentially recreate a GoalHandler naming it a ChangeStateHandler. Further, the change_state process already publishes events on change (i.e., published feedback).

We think ideally we could replace the current ChangeState.srv and corresponding transition functions to be a ChangeState.action instead. However, this would fully break all backward compatibility across all lifecycle work (e.g., rclcpp::lifecycle, rcl::lifecycle, rclpy::lifecycle …). A possible solution we are thinking about is some form of tick-tock deprecation pattern where both co-exist but this is still very far reaching.

On a related note (but much broader than this post / may be worth discussing in another post), we have talked quite a bit about how ROS 2 Actions possibly should be thought of as Asynchronous Services that can be optionally cancelled. We have found (especially in this implementation of async lifecycle transitions) that using Async Services (introduced in rclcpp#1709) leads down a path of wanting the remaining components already implemented in an Action (accept / reject requests, canceleable, feedback on completion …).

Current Proposal

Our current approach:

  1. re-organizes rclcpp lifecycle code to better fit a model-view-controller paradigm. Note this has been separated out into its own respective issue (rclcpp#2212) and PR (rclcpp#2211) as it is a large architecture
  2. is fully backward compatible by adding a ChangeStateHandler that allows for response deferral + cancellation monitoring with issue (rclcpp#2213) and PRs (rclcpp#2214 , rcl_interfaces#157)

These are rather large code base changes to rclcpp lifecycle in particular therefor we would like community feedback before a more thorough review.

More Detailed Expected Behaviors

To mitigate length but be as concrete as possible, this section is expandable. It goes into the finer grained details of exact expected behavior for a deferrable + canceleable transition.

Expand

The “user” refers to the user specific Lifecycle node code (e.g., the user’s on_conifgure implementation). The “Lifecycle backend” refers to the underlying rclcpp/rcl implementation. Finally, some of these descriptions are described w.r.t. ChangeStateHandler/our current proposed implementation for convenience. Higher level design language would be ideal if used for future design documents.

Deferral

  • User is passed a shared_ptr or equivalent handle (e.g., shared_ptr<ChangeStateHandler>) with which they can send a CallbackReturn response when they wish (could be immediately as before or defer until later)
  • When calling an async transition callback, the Lifecycle backend relinquishes control of the executor thread to the transition callback and does not expect it back until response (must wait until user sends a change_state_hdl->send_callback_resp(CallbackReturn) or handles a cancel request)
  • The handle is only valid for 1 send_callback_resp; When send_callback_resp is called, the handle is subsequently invalidated.
  • A user can check for a valid handler atomically (e.g., change_state_hdl->is_executing())
  • By default, Lifecycle transition callbacks remain synchronous, requiring a register_async_on_X(function) to override the default synchronous function
  • Only 1 transition function can be registered per transition state callback at any given time (see rclcpp#2216)
  • The client of a ChangeState.srv will receive a success = true upon the successful completion.
  • “Successful completion” is defined by the underlying state machine being updated to a primary state and the change event being published. Note it does not refer to CallbackReturn::SUCCESS, just a full completion which can be made up of CallbackReturn::FAILURE/ERROR as well.
  • At most 1 transition request can be processed at any given time on a first come, first serve basis. All transition requests made while a transition is ongoing will immediately be responded to with success = false with an error message indicating as such (see rclcpp#2154)

Cancellation

  • A new CancelTransition.srv (or equivalent) & respective service allows for external node requests to cancel an ongoing transition
  • A user can check for a cancel request atomically (e.g., via change_state_hdl->is_cancelling())
  • The handler implements a cooperative cancellation policy
    • It is up to the user to monitor for and unwind a cancelled request; This is due to the user being the only one who knows how to unwind at their defined points within a transition
    • A user can ignore a cancellation request completely:
      • A successfully completed transition request supersedes an ongoing cancel request with the state machine being updated according to the completed transition response.
      • If a user decides to ignore the cancellation request and subsequently successfully completes a transition, the cancel requester will be responded to with success = false and given an error reason.
  • A user can respond to a cancelled request (e.g., via handled_cancelled(bool)) indicating a successful handle or not
    • Upon a change_state_hdl->handled_cancelled(true), the lifecycle node will follow the CallbackReturn::Failure path; This keeps the same valid state machine while often “falling back” to the prior state
    • Upon a change_state_hdl->handled_cancelled(false), the lifecycle node will follow the CallbackReturn::Error path
  • a CancelTransition.srv requires a request field indicating the desired transition to be cancelled. This is to avoid race conditions / follows RESTful concurrent PUT request design. Note this is not in the current proposed implementation but is planned to be added.
3 Likes

Haven’t had the time to digest everything here yet, but I’d like to add to the discussion that it’s very possible to call services within other service calls or timers by creating another executor to handle the work. As an example, see how nav2 has implemented their lifecycle manager.

The separate thread is not necessary. I have used the sub-executor both for asynchronous and blocking calls (i.e., calling spin_until_future_complete on the sub-executor). As mentioned by wjwwood in one of the linked issues, coroutines may be an elegant solution to the problem in the future (C++20).

Yes there are a few ways to call a service within a service as you mention. You can spin up another internal executor although I would personally put this into the “hacky-workaround” bin. Not that the workaround is bad but more-so that it should* not be necessary for a common functional need of calling a service from within a service.

I think a good point @oysstu brings up here is there should* exist something that allows you to easily to throw callback work onto the executor (I’m also a big fan of coroutines coming more from the games world / another rclcpp discussion I found discussing coroutines while working on this: rclcpp#1533). I think this would be a great issue to bring up / feature to add (although a bit off topic for the core of this post in my opinion).

One of the core ideas and motivations of this post is to allow for cancellation requests from external nodes while mid-transition. This also extends toward the idea we want other services available while a transition is ongoing (e.g., GetState.srv).

This is not possible* (although I am sure there is a very hacky workaround) with spinning up a secondary executor as your node’s main executor cannot receive/process any requests as it is still servicing the ChangeState callback itself (unless the node is on a multithreaded executor which shouldn’t* be a requirement).