Why a strict bidding time window system vs a timeout system for task bidding? (#545)

Posted by @dillonloh:

Background

Hello OpenRMF developers,

First of all, thank you for building this amazing piece of software. My team has been testing it extensively for use in large-scale simulations, and it has worked great. However, one thing that has been difficult to work around has been the strict bidding time window that we must predefine as a launch parameter. We noticed that openRMF’s bidding system is such that even if all connected fleet adapters have submitted their bids to the core, the system does not immediately end the “auction” and instead waits till the full duration of the bidding window has passed. Normally, this probably isnt an issue, since I assume that the system wants to be flexible enough for late bids to come in.

However, in our use case where we expect >20 tasks to eventually be in the robot’s task queue, this system becomes difficult to work with. In particular, it is difficult for us to define a good bidding time window that balances the need to give sufficient time to fleet adapters for cost calculation, and the need to assign tasks within a reasonable amount of time.

I illustrate this with 2 scenarios below:

Scenario 1
Say we set the bidding time window to a reasonably short value of 10 seconds. When the robot still has short task queues at the start of its lifetime, this works fine, as all bids can come in within seconds. However, as task queues grow longer, robots will not be able to calculate costs quickly enough to submit bids within the window. In the extreme case where this applies to all robots, the task is skipped and never assigned again.

Scenario 2
We set a longer bidding time window, hypothetically long enough (e.g. 10 minutes) that a fleet adapter will always be able to submit a bid before the window expires, irregardless of how long the task queues get. However, this means that even when task queues are short and all bids come in within seconds, the auctioneer still waits for the full 10 minutes to pass before actually assigning the task, which is extremely inconvenient.

Question

Given my limited perspective of the above issues, I would greatly appreciate it if I could better understand the reason behind this design decision.
It seems to me that a “loose” timeout system would be a better fit, since we can now handle both short/long task queue situations. Specifically, we solve the problem of excessive waiting for short task queues scenario.
I illustrate this below (assume we set a timeout of 10minutes).

Short Task Queue
Bids will come in within seconds here. Once all bids come in, since our system is loose, we immediately close the auction. This keeps auction times short when task queues are short and cost calculations are fast.

Long Task Queue
Works same as before, but now I am not worried about how long I set my timeout since it wont affect short task queue situation.

I am still learning about how traffic deconfliction, task bidding systems, etc. work, so I would greatly appreciate if you could help me better understand the design decisions taken by your team in building this system, along with your thoughts on my idea.

Thank you.


Edited by @dillonloh at 2024-10-18T09:27:08Z

Posted by @Yadunund:

Hi @dillonloh,

Thanks for sharing your positive experience with Open-RMF!

You raise a valid concern about bidding_time_window parameter being almost a “magic number” to get right esp over the course of the application’s lifetime.

The Dispatcher has no awareness of the names or number of fleets running in an RMF deployment since we have a decentralized system. Hence, it currently does not have a mechanism to check if all fleets have responded to the bid notice.

There are a few ways to approach this

  1. We subscribe to /fleet_states and keep track of which fleets are available and have responded. But this has some problems
  • Fleet adapters do not publish over /fleet_states unless enabled in their configs
  • A fleet discovered may not be a Full Control type so it’s not capable of responding to task bids.
  • A fleet previously discovered might have all its robot decommissioned so it won’t respond to any new bids.
  1. We subscribe to /rmf_traffic/participants published by the Traffic Scheduler node to get the list of active robots/fleets publishing schedule changes.
  • But we still suffer from the same limitations as above.

To make either of these approaches work, we’d need to encode information about the capability of a fleet (full contorl, traffic light, etc) into the above msgs published and update fleet adapters to report this. But exposing this information is not great imo since users can use the rmf_fleet_adpater APIs to mix and match capabilities to write fleet adapters which would result in something that does not fit our category buckets. We could make the task bidding capability something modular which Full Control includes by default and then have this capability advertise that this fleet can respond to bids. But this is a big change that might be made redundant by Workflows we’re working on.

Some other workarounds that come to mind are

  • Adding a bid timeout parameter to the dispatch_task_request schema and use this value in the Dispatcher node to timeout.
  • Update the Dispatcher implementation to make bidding_time_window a dynamic parameter which can be updated over the lifetime of the node via ROS 2 parameter service calls.
    (or maybe we can do both).

Open to other suggestions as well!