Hi everyone,
At iRobot, we have been working on lifecycle node transitions and are looking toward the community w.r.t. designing more flexible transition functions (e.g., on_configure
). Namely, transitions that can:
- be deferred (i.e., async)
- be cancelled
Core github issue describing this: rclcpp#2213
What is the current behavior / limitations of lifecycle node transitions?
Currently lifecycle transition functions require a CallbackReturn
thus making them synchronous (i.e., we block on the executor thread until we return). This poses 2 specific issues:
- we cannot call services within an ongoing service without hacky workarounds (rclcpp#773 ; rclcpp#2057)
- we have no way of interacting with a node externally when it is in a transition.
How to handle dependencies/interoperability in lifecycle transitions?
I’ll use examples in this section to demonstrate the limitations/needs:
1. Deferral / Async Transition Need:
In the above example, we have:
- (1) An external
SupervisorNode
that requests a change state on our LifecycleNode. - (2) This is then serviced by the executor, eventually calling the
CallbackReturn on_configure(State)
function. - (3) The
LifecycleNode
’son_configure
requires calling an external service - in this case aGetParameter
where we want thebattery_thresh
. - (4) Our
BatteryNode
responds, putting the response in theEventsQueue
for the executor to process when ready. However, theon_configure
is still being serviced, waiting on this response to be processed. Therefore we are deadlocked.
A proposed alternative would be to pass some handle to the user. They can do what they wish (call services / spawn other threads etc) and send a response using the handle. Code examples for reference: rclcpp#2214.
2. Cancel Need:
)In the above example, we have:
- (1) A
SupervisorNode
that monitors the state of the overall system, bringing nodes up/down depending on their dependencies. It sees the dependencyLCConsumer
has onLCProducer
and checks thatLCProducer
isACTIVE
. - (2) This is true
- (3) Therefor it tries to transition the
LCConsumer
to ACTIVE`. - (4) As the transition starts for
LCConsumer
,LCProducer
raises an error (ros2_design#283) and attempts recovery by runningon_error(State)
callback. - (5)
LCConsumer
is now stuck inACTIVATING
as it depends on aLCProducer
service. We have no way to transition out ofACTIVATING
from ourSuperVisorNode
’s point of view.
An ideal solution here would be to allow for these transitions to be canceleable. The user transition code (in the above example it would be LCConsumer::on_active(State&, DeferResponseHandle*)
)would be responsible for monitoring for cancels. If the user acknowledges the cancel request, they are responsible for cleaning up the ongoing transition and responding (i.e., a cooperative cancellation approach). Exact behavior can be found in the “More Detailed Expected Behaviors” section below. Code examples for reference rclcpp#2214.
Should lifecycle transitions be Actions?
You may notice these deferrable + canceleable transition function needs are entirely encompassed under ROS2 Action
s. In our current proposed approach (see “Current Proposal” section below), we essentially recreate a GoalHandler
naming it a ChangeStateHandler
. Further, the change_state
process already publishes events on change (i.e., published feedback).
We think ideally we could replace the current ChangeState.srv
and corresponding transition functions to be a ChangeState.action
instead. However, this would fully break all backward compatibility across all lifecycle work (e.g., rclcpp::lifecycle
, rcl::lifecycle
, rclpy::lifecycle
…). A possible solution we are thinking about is some form of tick-tock deprecation pattern where both co-exist but this is still very far reaching.
On a related note (but much broader than this post / may be worth discussing in another post), we have talked quite a bit about how ROS 2 Action
s possibly should be thought of as Asynchronous Services
that can be optionally cancelled. We have found (especially in this implementation of async lifecycle transitions) that using Async Services
(introduced in rclcpp#1709) leads down a path of wanting the remaining components already implemented in an Action
(accept / reject requests, canceleable, feedback on completion …).
Current Proposal
Our current approach:
- re-organizes
rclcpp
lifecycle code to better fit a model-view-controller paradigm. Note this has been separated out into its own respective issue (rclcpp#2212) and PR (rclcpp#2211) as it is a large architecture - is fully backward compatible by adding a
ChangeStateHandler
that allows for response deferral + cancellation monitoring with issue (rclcpp#2213) and PRs (rclcpp#2214 , rcl_interfaces#157)
These are rather large code base changes to rclcpp
lifecycle in particular therefor we would like community feedback before a more thorough review.
More Detailed Expected Behaviors
To mitigate length but be as concrete as possible, this section is expandable. It goes into the finer grained details of exact expected behavior for a deferrable + canceleable transition.
Expand
The “user” refers to the user specific Lifecycle node code (e.g., the user’s on_conifgure
implementation). The “Lifecycle
backend” refers to the underlying rclcpp
/rcl
implementation. Finally, some of these descriptions are described w.r.t. ChangeStateHandler
/our current proposed implementation for convenience. Higher level design language would be ideal if used for future design documents.
Deferral
- User is passed a
shared_ptr
or equivalent handle (e.g.,shared_ptr<ChangeStateHandler>
) with which they can send aCallbackReturn
response when they wish (could be immediately as before or defer until later) - When calling an async transition callback, the
Lifecycle
backend relinquishes control of the executor thread to the transition callback and does not expect it back until response (must wait until user sends achange_state_hdl->send_callback_resp(CallbackReturn)
or handles a cancel request) - The handle is only valid for 1
send_callback_resp
; Whensend_callback_resp
is called, the handle is subsequently invalidated. - A user can check for a valid handler atomically (e.g.,
change_state_hdl->is_executing()
) - By default, Lifecycle transition callbacks remain synchronous, requiring a
register_async_on_X(function)
to override the default synchronous function - Only 1 transition function can be registered per transition state callback at any given time (see rclcpp#2216)
- The client of a
ChangeState.srv
will receive asuccess = true
upon the successful completion. - “Successful completion” is defined by the underlying state machine being updated to a primary state and the change event being published. Note it does not refer to
CallbackReturn::SUCCESS
, just a full completion which can be made up ofCallbackReturn::FAILURE/ERROR
as well. - At most 1 transition request can be processed at any given time on a first come, first serve basis. All transition requests made while a transition is ongoing will immediately be responded to with
success = false
with an error message indicating as such (see rclcpp#2154)
Cancellation
- A new
CancelTransition.srv
(or equivalent) & respective service allows for external node requests to cancel an ongoing transition - A user can check for a cancel request atomically (e.g., via
change_state_hdl->is_cancelling()
) - The handler implements a cooperative cancellation policy
- It is up to the user to monitor for and unwind a cancelled request; This is due to the user being the only one who knows how to unwind at their defined points within a transition
- A user can ignore a cancellation request completely:
- A successfully completed transition request supersedes an ongoing cancel request with the state machine being updated according to the completed transition response.
- If a user decides to ignore the cancellation request and subsequently successfully completes a transition, the cancel requester will be responded to with
success = false
and given an error reason.
- A user can respond to a cancelled request (e.g., via
handled_cancelled(bool)
) indicating a successful handle or not- Upon a
change_state_hdl->handled_cancelled(true)
, the lifecycle node will follow theCallbackReturn::Failure
path; This keeps the same valid state machine while often “falling back” to the prior state - Upon a
change_state_hdl->handled_cancelled(false)
, the lifecycle node will follow theCallbackReturn::Error
path
- Upon a
- a
CancelTransition.srv
requires a request field indicating the desired transition to be cancelled. This is to avoid race conditions / follows RESTful concurrent PUT request design. Note this is not in the current proposed implementation but is planned to be added.