Service Load-Balancing

Hi ROS community,

We would like to open discussion and feedback for service load-balancing.

The Purpose

ROS2 service must be able to have enough resource to response all incoming quests. This is gonna be hard to do so on edge devices. For the same service name, the current implementation allows multiple service clients but only one service server. If there are multiple service servers for the same service name, each of these service servers will receive the same request from a service client and respond individually, which will result in incorrect behavior.

So, we want to implement that multiple service servers on the same service path to achieve redundancy and load-balancing

The Rough Design

The existing service client application and service server application can support this without code changes. They just need to remap the service path through parameters at startup. For service client, the service path is remapped like --ros-args -r add_two_ints:=add_two_ints/load_balancer. For service server, the service path is remapped like --ros-args -r add_two_ints:=add_two_ints/load_balancer/server-1, --ros-args -r add_two_ints:=add_two_ints/load_balancer/server-2.

A new load-balance service node.

  • A service server proxy.
    It is responsible for connecting external service clients.
    service server proxy (e.g. add_two_ints/load_balancer) will start simultaneously with the node startup.

  • Load Balance Policy.
    Select a service proxy client based on the current load-balancing strategy.

  • Service client proxy
    When a new backend service server is detected (e.g. find add_two_ints/load_balancer/server-3), a new service client proxy will be started to connect to the newly discovered backend service server.


Basic execution process

  • Request callback in Service Server Proxy

    • Get writer GUID, sequence_number and serialized request data(pointer), and put them into Request_Receive_Queue. (Writer GUID and sequence_number can be gotten from rmw_request_id_t)
    • Notify the change of Request_Receive_Queue
  • Load-balance thread

    • Wait for the change of Request_Receive_Queue
    • Get writer GUID, sequence_number and serialized request data(pointer) from Request_Receive_Queue
    • According to specified load-balance policy, choose which Service Client Proxy is used.
    • Use the selected Service Client Proxy to send the serialized request data and get sequence_number.
    • Save the corresponding relationship to the table (When the Service Client Proxy receives a response, it needs to refer to this table to determine which service client to send the result to.).
      writer GUID and sequence_number ↔ Service Client Proxy and sequence_number
    • Remove writer GUID, sequence_number and serialized request data from Request_Receive_Queue
  • Response callback in Service Client Proxy

    • Get Service Client Proxy(pointer), sequence_number and serialized response data(pointer), and put them into Response_Receive_Queue
    • Notify the change of Response_Receive_Queue
  • Forward response thread

    • Wait for the change of Response_Receive_Queue
    • Get Service Client Proxy(pointer) and sequence_number, and query the table to get the corresponding writer GUID and sequence_number. Remove this corresponding relationship in table.
    • Service Server Proxy send serialized response data with writer GUID and sequence_number
    • Remove Service Client Proxy(pointer), sequence_number and serialized response data from Response_Receive_Queue

Load balancing algorithm

  • Round Robin
    Allocate service servers sequentially according to the order of requests.

  • Balance the number of requests
    Send new requests to the service server with the fewest currently active requests.

  • Balance response time
    Send new requests to the service server with the shortest average response time.


I would greatly appreciate any feedback or suggestions.
Please feel free to share your thoughts.

2 Likes

Can you give an example of the kind of service you think might need this? I fail to see it. Robots are not usually meant to do any heavy lifting for outside entities, their resources tend to be stretched enough by their own needs.

You mention edge devices; is this perhaps meant for edge computing, i.e., a server that provides services to the robots? If so then I don’t understand why you want to use ROS for that, given, as you noticed, it’s not a good fit for this. Why not just use TCP/HTTP directly together with an existing load balancer like HAProxy?

Thank you for your question.

I currently don’t have any real examples. For edge device, HAProxy is a good solution.
I’m considering only using ROS. Currently, all computation in the robot is still provided through ROS services. As development progresses, we might find that too much computation causes the device in the robot to respond slowly. At that point, we would add new devices to provide the same services without code change.

The idea mentioned above has already been implemented, with some minor adjustments, of course.

You’re welcome to try it out and give your feedback.