Task Error Handling - onSuccess/onFailure (#469)

Posted by @ItamarEliakim-R:

Hey.

What would be the best solution to implement a task state error handling as part of a generalized composition?
I am looking for a solution to implement the following flow:

  • go-to-place kitchen → wait-for x sec → pickup coke from table →
  • Upon success - dropoff coke at entrance.
  • Upon failure - go to charger.

Is it possible to do it as part of the core that applies to the entire fleets (e.g., rmf_task & fleet_adapter), and not as a hack on the fleet adapter vendor side?

Thanks!
Itamar


Edited by @ItamarEliakim-R at 2024-05-28T13:34:07Z

Posted by @mxgrey:

Right now unrecoverable errors in task execution are dealt with by cancelling the task. Your fleet adapter can monitor whether the robot is successfully executing its commands and then cancel the task based on the current task ID.

The composed task schema lets you specify actions to take when cancelling the task at each phase. You could put some error handling behavior inside that cancellation sequence.

In your specific example, since the error handling is just to go to the charger, you probably don’t need to include any special cancellation sequence. If you just use the charging finishing task then a simple cancel will be enough.

We don’t currently support any retry loops within the core task execution pipeline. As you’ve already seen, support for more generalized task flows is something we’re actively working on, but is not available yet.

Posted by @ItamarEliakim-R:

Sounds excellent.
Thanks, @mxgre! My first guess was that canceling the task as handling the error was a decent solution for now. I’ll explore the rmf task core further within the next few days/weeks to think of new methods to extend the current task manager and parser to support the success/failure mechanism until we have the major redesign.

Posted by @mxgrey:

In all truth I’m unlikely to accept any major changes to the existing codebase, especially changes to the implementation of task execution, because we’ve found it to be a hotbed for data races and other forms of undefined behavior.

Any slight misstep with how callbacks are triggered can cause data corruption. These mistakes are very difficult to notice even if you know exactly what to look for, and they generally won’t show up in CI because there’s only a 1/10000 chance of them occurring per run.

A subtle mistake that gets introduced in order to support a new feature will likely cause a bug for everyone, even those who are not using the new feature. All this baggage is why we’re trying to aggressively migrate to the new frameworks that provide guaranteed memory safety, even in the presence of massive multi-threading.

You’re always welcome to fork the codebase and experiment. You can also open a PR if you come up with something you’re confident in, and I’ll give genuine consideration to whether to merge it. I just want to be clear upfront that I’ll be extremely risk averse to any substantial implementation changes, especially anything that smells like it could conceivably introduce a memory safety issue, even if I can’t actually prove that there’s a problem. I can’t afford to keep hunting down data races since it comes at the cost of making progress on the long term solutions that will eliminate the threat of data races altogether.


Edited by @mxgrey at 2024-05-29T12:08:41Z

Posted by @ItamarEliakim-R:

Thanks a lot for your clear and straightforward feedback. I completely get where you’re coming from with the potential risks in making major updates, especially with all the complexities involved in task execution and the problems you’ve mentioned.

We’re definitely on the same page about keeping the codebase stable and safe. I’ll make sure to keep any changes minimal, and we will do extensive internal checks/validation before suggesting anything new through PRs. Our goal is to ensure that these changes integrate well with the existing code without causing any disruptions.

Thanks for allowing to experiment and for considering any solid PRs we submit. I’ll be extra careful and thoughtful in our design and approach to ensure contribute positively and align with the broader goals of memory safety and handling more effectively. I’ll understand if the community decides to pass on these new sets of features because they don’t quite fit with the new direction or standards :slight_smile:

Posted by @rushelby123:

Thanks to both of you @ItamarEliakim-R and @mxgrey , I have found this discussion to be very helpful. I am new to the RMF community and I am working on a project related to error handling. I was wondering if RMF have some integrated mechanism to handle errors or unexpected scenarios.

To be more specific, for sake of comprehension let’s make an example:

Assume to have two robots that can accomplish a task, let’s say that one of these two is picked by the dispatcher to execute it. Now assume, for some reason, that during the execution of this task the robot that was picked by the dispatcher cannot execute the task anymore.

Now, what I would do as a recovery plan in this specific case would be to ask to reschedule the task accounting of the error, and try again with the other robot before canceling the task, but as far as I understood this is not done automatically by RMF and it’s up to the user to define such mechanisms?

Let me know if there are any update on this topic, I think that can be interesting for everyone.

Posted by @rushelby123:

I opened a discussion about this matter. I’d really appreciate your help! #686