How do you connect your robots to Kubernetes (K8S) clusters?

Hi all!

I am looking for a general solution to the problem of connecting a robot to ROS nodes running inside a k8s cluster. It’s a scenario that I believe will become more and more common in practice, however I could not find any guidelines or best practices anywhere.

Basically the core of the problem is that ROS nodes use TCP/UDP/DDS to communicate with each other on random ports, while k8s wants to expose services on configurable ports and typically over HTTP(S). Ingress controllers have to be configured to route packets from the robot (outside the cluster) to the “right” destination pod. This doesn’t really play well with ROS.

I have looked around here on discourse and played with a few implementations I found online, but I would like to know if I am missing something obvious or if somebody has come up with other solutions.

As of now, I am aware of the following approaches:

  1. Using a VPN (e.g., wireguard, or maybe Husarnet? e.g., Connecting Remote ROS 2 Nodes using Docker & VPN) to create an overlay between the pods subnet/namespace and the robot(s);

  2. Extending the k8s cluster through federation, basically adding a robot as a “remote cluster” (e.g., Robotics Distributed System based on Kubernetes) and using cluster “internal” networking with the appropriate CNI plugin;

  3. Using Ros Bridge (GitHub - RobotWebTools/rosbridge_suite: Server Implementations of the rosbridge v2 Protocol) in Web socket mode, exposing it through a k8s ingress, and running a client on the robot (e.g., rosduct GitHub - uts-magic-lab/rosduct: Proxy to expose remote ROS topics, services and parameters locally thru rosbridge );

  4. Using message brokers (e.g., RabbitMQ) on the k8s cluster and “cloud bridges” to relay ROS messages across nodes connected to different ROS masters (e.g., ROS Routed Networks :: rapyuta.io Documentation );

  5. Using ROS only within robots / cloud and transforming ROS messages to another representation on the internet;

Are you aware or (even better) do you use any other solution in production?
Thank you in advance.

All the best,

Gio

2 Likes

Personally I think using ROS to communicate with the cloud, where I assume your kubernetes cluster runs, is a bad idea. ROS communication wasn’t designed for that. For instance, sending the exact same message 100 times a second is fine in ROS, but you absolutely wouldn’t want to do that over an uplink to the cloud, even if it’s wifi. So your Approach 5. is the way to go if you ask me, and a key element in that transformation is to reduce messages to just diffs (perhaps with “keyframes”, i.e., complete messages every once in a while) and throttling.

I’m actually working on a new open-source project that solves this and related problems of cloud robotics. Let me know if you want to chat more about this 1:1.

2 Likes

Hello Chritian Fritz
Please let me know your email to discuss on Cloud Robotics.my email:kasi.ceo@gmail.com

Giovanni,

I’ve run production systems involving processing of data sets from field robots in the cloud (low-latency inference systems) and have also always gone with Approach 5.

I think there are a few reasons to consider this

  • ROS is not obviously designed to work in a Cloud Native fashion. You can get it to work with K8’s but it wouldn’t work in a general way with other Cloud technologies. Mainly thinking of lambdas or functions here.
  • The kinds of people you would typically hire to run in Cloud won’t necessarily be familiar with ROS and might not want to use it or at least would want to limit the areas where ROS is present.
    I myself have always worked in a system where ROS terminated at the Edge either in the Field Robot or immediately on ingress into the cloud via some sort of proxy nodes. Typically this happened via some translation layer where ROS was converted into JSON/Protobuf etc and the more Cloud oriented messaging technology was, typically some pubsub.

A mixed case would be running K8’s on-premise in a multi-robot scenario, where you do have sufficient bandwidth for ROS messages. My current work is in this situation and we have opted for HashiCorps Nomad over a full K8’s distribution. We are mainly interested in a container launch solution rather than a full-fledged workflow orchestration engine. Our solution for networking is, essentially don’t use it and run on the host network.

Happy to provide more details.

3 Likes

Hey there,

been there also many times…

Even just subnetworks can cause the same pain that you are feeling. I’m describing a setup where you have a mobile robot consisting out of a few parts (IPC, PLC, Scanners, HMI, Drives, Switch, …). This mobile robot is not alone on the (industrial) network. There may exists multiple instances of the same robot, therefor it is good practice to give those robots their own sub network to only face with one IP address towards a common field-level network.

Also, in my experience only 5 really works well for productive use cases.
However, I see this only holds true for ROS1.

If you are able to upgrade to ROS2, I find this situation a bit more solvable.
We are working a lot with docker compose setups (Siemens Industrial EDGE marketplace) that have strict privilege settings for the industrial setting. So docker run --network host <container> is not working in our setting.
What we found to be working quite well is the Gateway feature from RTI, a DDS vendor. Just open a few deterministic ports (docker / k8s / …) and you are ready to go.
Also, we happen to have good experiences with SoSS - now called eProsima Integration Service. But with SoSS, we have only experience in the context of connecting ROS2(DDS) with MQTT, ROS1 or other ROS2 domains. I’m pretty confident however that this tool could help you overcome your network issues in some way as well.
FastDDS also seems to have a WAN solution based on the Integration service that might come in handy.
In context of converting from ROS1 towards ROS2, the widespread ros2/ros1-bridge tool often had some kind of performance issues in our test scenarios with partners.

Depending on your setup, you have to understand what you are trying to do with the cloud and why.
High-level convenience ROS-service/-action calls to conduct your fleet or do you want to back-up (all) the data for some kind of passive AI detection or do you want to have an open-control loop (send laser data, receive localization/navigation data/instructions).
With DDS you have some QoS settings that can either confuse you, as RVIz tends to have some problems displaying different types of QoS… Or you can benefit from it (faster, mitigate unstable connections), as having some data (camera stream) loss during transmission might be fine for your usecase (QoS: best-effort setting).

Hoped this helped a bit.
Cheers! :slight_smile:

3 Likes

Hi,

You could also take a look at Zenoh.io. https://zenoh.io/, https://zenoh.io/docs/getting-started/key-concepts/ .At a higher level, the https://fog05.io/ project uses Zenoh to orchestrate a network of nodes… and this includes nodes running on K8. https://fog05.io/docs/going-deeper/architecture/ .

There is also ROS2 integration. https://zenoh.io/blog/2021-04-28-ros2-integration/

Cheers!

Sojan

1 Like

Hi Christian,

thank you for your answer, and thank you also for your blog post on a “full stack platform for robotic applications”. I have been pushing for the same idea for a few years now, with the same argument and the same line of reasoning, even competing for EC funding for it. I used the from to get in touch with you.

I also agree that in a “remote” k8s cluster scenario (even on wifi) one will want to minimize uplink data transmission and/or control it on demand as it’s done in dynamic monitoring. Diffs, “key frames”, possibly a mechanism that even filters messages if changes are below a certain threshold (e.g., 0.0001 on joint positions) are all worth investigating. I’ll be glad to chat about this too.

I see that cloud bridge development is advancing at great speed, unfortunately I am not involved (my own fault), but I wonder if the “robot side” of the bridge shouldn’t be part of the same project and use the ideas you mention above. Maybe it’s already on the timeline, but I am not aware of it (again, my fault).

As for my general question above, I also have a teaching scenario at ZHAW where I have local robots and a local k8s cluster (albeit on 2 isolated subnets). In that case, also for teaching purposes, I need the students to be able to “see” all whitelisted traffic flowing between robots and k8s pods.

Still, ros bridge + rosduct imply them starting, configuring, and managing additional components not relevant to the class, while my school’s IT admins don’t like the VPN option as it breaks their isolation model. That’s why I was looking for other options I might have missed.

1 Like

Hi Barret,

thank you for your reply.
Apart from “format on the wire” between robots and “cloud”, I think you are bringing up an even more fundamental question when you mention “cloud native” / “FaaS”.

I see basically 2 different sets of components:

  • Stateless (but db-backed): e.g. common “monitor and save” scenarios of info coming from the robot, digital twins based on this (stored) data;
  • Stateful (hence hardly cloud-native): e.g., the move_it suite we use to control robots in the labs and that runs in cloud/k8s containers.

The former can be easily implemented in public cloud with cloud-native principles as they are meant to serve many robots at the same time.

I wonder whether services like the latter are even meant to be “cloud native” at all. After all, in my understanding, they are ideally deployed per robot instance on-prem (at the edge) and using containerization more for latency and resource sharing reasons.
They might still benefit from k8s HA control-loops, but are not living under the high-churn and failure rate that you could expect in a public cloud infrastructure. So somehow “edge cluster” conditions are different than “public cloud”, hence your choice to go with a much simpler container orchestration model rather than full-fledged k8s.

Thank you!

Thank you Flo, this helps a lot!

I will go through all the references you posted, maybe also a chance to have the students work with ros1 and 2 at the same time.

I also have IT admins imposing subnet isolation and banning VPNs, so it will have to be jumping through hoops…

Is this the RTI Gateway you mentioned? GitHub - rticommunity/rticonnextdds-gateway: The RTI Gateway is a software component that allows integrating different types of connectivity protocols with DDS. Integration in this context means that data flows from different protocols are adapted to interface with DDS, establishing communication from and to DDS.

Cheers

Thank you Sojan,

I looked at the links you posted, and if I understood correctly what you suggest amounts to using one Zenoh router, then 2 or more Zenoh/DDS bridges on each side. That’s akin to solution 4 I mentioned above.

Most of of the robots our students use in class are still based on ROS1: is there a Zenoh-ROS1 bridge?

I imagine having ROS1->ROS2 bridge and then DDS->Zenoh and back is overkill

1 Like

while my school’s IT admins don’t like the VPN option as it breaks their isolation model. That’s why I was looking for other options I might have missed.

I also have a teaching scenario at ZHAW where I have local robots and a local k8s cluster (albeit on 2 isolated subnets). In that case, also for teaching purposes, I need the students to be able to “see” all whitelisted traffic flowing between robots and k8s pods.

thanks for sharing the use case. as discussed in this thread, there are many ways to do this, but i guess it depends on use case and requirement.

how about the following in this case?

conditions/assumptions:

  • ROS master is running in k8s pod with CNI implementation in LAN-A.

  • some other ROS nodes are running in k8s pod with CNI implementation in LAN-A.

  • students use network LAN-B only.

  • students can use docker container?

procedure/setup:

  • use WeaveNet as CNI in k8s cluster in LAN-A.

  • issue weave launch/expose from students node in LAN-B to specific weave router in LAN-A.

  • bind the weave CNI to container that runs students ROS nodes to access ROS master.

i might be mistaken and not sure this is acceptable to your IP admin, but that is something i would do…it should be working even with ROS 2.

thanks

2 Likes

Thank you Tomoya.

That’s an excellent idea!
I’ll talk to the cluster admins and see if we can use a different CNI plugin

Hi gtoff,

Looking at your use case I’d say that eProsima DDS Router is the way to go. The DDS Router is a non-graphical user software application that allows you to easily connect DDS networks over WAN communication. The DDS Router source code is open source and you can download it from its GitHub repository.

In your specific case, it would connect a local DDS network with ROS 2 nodes publishing on and subscribed to some topics with another ROS 2 application deployed in a Kubernetes cluster in the cloud. Deploying a DDS Router on your local network and another on the Kubernetes cluster would create a communication channel between the two so that all messages published on your local network would be re-transmitted to the cloud and vice-versa.

As @chfritz says, forwarding all messages can be highly demanding in terms of network traffic.
That’s why the DDS router can be configured to filter the topics that will be forwarded so that not all the data published in the local network is sent to the cloud.

Also seeing the growing use of Kubernetes as a framework to deploy end user device orchestration systems, we have documented how to deploy ROS 2 nodes on Kubernetes and to connect them to local ROS 2 robots using the DDS Router. Please take a look at this example as I think that it fits your use case.

Finally, the new DDS Router is intended to be the WAN solution for DDS since it is designed for this specific case, leaving eProsima Integration Service as a protocol translator.

5 Likes

Thank you Raúl.
I’ll try out the example you shared.

Most of my robots are still on ROS1, but I guess I could use the ROS1->2 bridge and really reduce the topics and rates of publication

1 Like

Just sharing the information for those who are interested in this use case.

We had a chance to talk about this ROS with Kubernetes topic at KubeCon EU 2021 Edge Day.

please feel free to reach out to me with questions and comments :smiley:

5 Likes

Thank you for sharing. The attached presentation is very insightful.

I have been experiencing some issues with using the ROS2-ROS1 bridge on foxy and zenoh.
Is there a solution to this issue? There are no topics discovered on the ros1 end even after starting the dynamic bridge with a correct ROS_DOMAIN_ID.

You’ll probably have better luck posting these kinds of questions in their own thread on ROS Answers :slight_smile:

2 Likes

NP! I got it working. Thanks

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.