ROS 2 on Kubernetes

SidFaber · November 9, 2020, 2:23pm

Hello all,

I just wanted to share a blog post titled ROS 2 on Kubernetes: a simple talker and listener setup. This is the second in a series on running ROS 2 on Kubernetes with MicroK8s. Posts about this on discourse and elsewhere document mixed results with ROS on K8s, and it seems the main challenge has been to get consistent RTPS networking to flow properly.

The first post unpacked some of the RTPS features that led to seemingly unpredictable behavior on K8s. Everything in that post should be consistent with middleware vendor documentation and actual traffic captured from a k8s cluster.

This post builds on the first to show how to network K8s pods. It uses MicroK8s with Multus to add a MacVLAN adapter for ROS to use. The config is just meant as a kickstarter, hopefully the discussion helps you create an implementation that works for you.

I’m putting together third post that shows how to extend the cluster out to multiple machines, and a fourth that talks in more detail about troubleshooting different configurations.

Hope helps you understand a bit more about running ROS 2 on K8s!

tomoyafujita · November 9, 2020, 11:20pm

@SidFaber

thanks for the post
I am interested in ROS/ROS2 & Kubernetes too. I’ve created the following thread before,

and this one is to presentation at Kubernetes Meetup in Tokyo, which describes our approach overview.

if possible, we could discuss requirement and possibility about MicroK8s for our use cases. that would be really appreciated

SidFaber · November 10, 2020, 1:42pm

@tomoyafujita, I spend lots of time studying that post & your examples, thanks for sharing! When trying to reproduce with microk8s I needed to use a different CNI to get multicast & discovery traffic to work. I don’t believe it’s possible to reliably run two different ROS 2 containers in the same pod, but it’s possible to reliably run multiple pods on the same k8s node with the right CNI setup. Next challenge is to deploy pods to hardware based on hardware attributes.

Any chance you’ll be at ROSWorld? Maybe we can catch up there, it’d be great to hear more about your use cases!

tomoyafujita · November 10, 2020, 2:10pm

we are one of the Gold Sponsor for ROS world 2020. I will be in touch with you

garrettwenger · December 3, 2020, 10:47pm

@SidFaber, I’m not intending to resurrect a dead thread here, but I happened to find your article when doing some research for a project at work, and your writeup helped us weigh the pros and cons of ROS on Kubernetes.

Based on the issues with multicast you discussed, it seems like this probably isn’t the right fit for us now, but then we found this thread about the new FastDDS discovery server. Admittedly I’m pretty out of my depth on the networking stuff here, but it seems like the server is a pretty direct solution to the multicast issues.

Seeing as the thread is more recent than some of your articles, I was wondering if this is something you’d already seen & tested, or if it’s new and possibly helpful for this particular use case?

Jaime_Martin_Losa · December 4, 2020, 1:55pm

Hi @garrettwenger ,

Curiously a few days ago I was presenting this architecture:

This is how I think everything fits together. For sure you can use the Discovery Server to simplify the discovery process in the cloud.

SidFaber · December 4, 2020, 4:44pm

Hi @garrettwenger, glad you found the posts useful! In the blog series I intentionally stayed middleware-agnostic and addressed generic RTPS issues that impact ROS 2 on k8s. Each middleware implementation gives you much finer control over networking, those options just aren’t exposed within ROS. Hopefully the blog posts–the fourth one in particular–provide you with some tools to make the middleware to suit your implementation. I haven’t done that customization myself (yet!).

As you continue exploring ROS + K8s, remember that multicast is only one of three behaviors to consider:

Multicast traffic: RTPS discovery is UDP multicast by default, and CNIs handle multicast differently. Rather than changing CNIs, you may instead be able to configure unicast discovery behavior for your middleware like @Jaime_Martin_Losa describes.
NAT’d traffic: RTPS discovery doesn’t survive NAT’ing (host IP/Port details are embedded in the discovery locator per the RTPS spec). Even with unicast discovery, still avoid anything that works through port translation.
Localhost traffic: all containers within a pod share the same localhost interface, but they don’t automatically negotiate ephemeral ports. If you don’t manage port assignments, two containers in the same pod will typically use the same port…but k8s will only allow one to succeed. This gets real confusing real quick.

My theory is that it’s best to explicitly manage the network interfaces used by ROS inside a pod (a middleware-specific config). Disable ROS on all default interfaces (including localhost), and add an overlay network interface (maybe vxlan?) just for ROS traffic within the cluster. That should provide good traffic segmentation and maybe even better performance / throughput.

ROS 2 on Kubernetes isn’t for everything, but I do believe it has some perfect niche use cases. It’d be great to hear more about what what others are working on, and any middleware-specific configs that make for a stable implementation.

joespeed · December 4, 2020, 8:11pm

Nice! @sgermanserrano of Linaro 96boards tested CycloneDDS ROS middleware on Kubernetes with good result

Jaime_Martin_Losa · December 4, 2020, 11:11pm

Hi @SidFaber

You can always configure a list of unicast peers, but it is more convenient to have just one or several discovery servers, more than maintaining such a list. Also, in a kubernetes system you could have potentially a large-scale system with many nodes, and in that case, the discovery server is going to minimize the discovery traffic.

tomoyafujita · December 5, 2020, 8:21am

@SidFaber

thank you for posing your thoughts and theory

CNIs handle multicast differently.

true, CNI is just an interface. everything depends on the implementation.

If you don’t manage port assignments, two containers in the same pod will typically use the same port…but k8s will only allow one to succeed.

I wasn’t clear on this. i think port assignment is not something k8s does but kernel. containers in the pod share the virtual NIC in the system requested by kubelet(k8s agent), then the rest is taken care by system(kernel). and if this port assignment is the problem, this is not only for ROS but cloud services. I maybe miss something that you are describing here, could you elaborate a little bit?

thanks in advance

SidFaber · December 7, 2020, 2:17pm

i think port assignment is not something k8s does but kernel. containers in the pod share the virtual NIC in the system requested by kubelet(k8s agent), then the rest is taken care by system(kernel).

Agreed, let me explain this a bit better as I understand it. Let’s assume we have two talker containers T1 and T2 in the same pod. All networking is managed by the pod, and both containers share the pod’s network namespace. Assuming T1 starts up before T2, it grabs the pod’s port 7400 on all the pod’s interfaces–including loopback–for multicast discovery. It also grabs port 7411 for unicast traffic–talking on a topic.* Shortly afterwards T2 starts up, and being an exact copy of T1, it attempts to use the same ports in the same network namespace. So what happens?

In my experience, T2 neither sends nor receives networking packets on port 7400 since the port is already in use. ROS logs show T2 publishing data, but network traffic captured at T2, at T1 and at the K8s host, do not show any datagrams from T2. The Kubernetes documentation simply recommends against two containers using the same port:

… containers within a Pod can all reach each other’s ports on localhost. This also means that containers within a Pod must coordinate port usage.

Kubernetes recommends port management–using different ports for T1 and T2–for containers within a pod. So, for example, that would mean running the discovery server on port 7401 in T2. But in DDS-speak, that’s akin to putting T2 in a different domain and simply doesn’t work with ROS.

An alternative Kubernetes solution is to add a k8s service to the pod; the service can then use IPC to communicate with the containers. However, k8s services depend on load balancing / address translation which also does not work with RTPS either (at least in my experience).

So my recommendation is to stick with the Kubernetes design principles to only run a single process in a pod.

Hope this helps, please post if you’ve experience things a bit differently. I’m continuing to explore Kubernetes and have been working mostly with MicroK8s.

*This assumes domainID=0; all port assignments actually depend on domain ID, participant ID and other parameters–see the RTPS spec, para 9.6.1

SidFaber · December 7, 2020, 2:25pm

@Jaime_Martin_Losa, my guess is that it’d be pretty easy to set up a discovery server as a daemonset and having one (and only one) running per k8s node. That could be really handy in keeping most discovery traffic on localhost, and making RTPS discovery behave just like DNS behaves within k8s. PM me if you want to chat more.

tomoyafujita · December 7, 2020, 11:47pm

@SidFaber

really appreciate for your clarification thanks.

i see what you saying here now.

it grabs the pod’s port 7400 on all the pod’s interfaces–including loopback–for multicast discovery. It also grabs port 7411 for unicast traffic–talking on a topic.* Shortly afterwards T2 starts up, and being an exact copy of T1, it attempts to use the same ports in the same network namespace. So what happens?

i believe that we already detected this before, and been fixed. which has something to do with participant identification and port allocation specified by DDSI-RTPS 9.6.1.1 Discovery traffic. (I believe that question is container runtime is the node described by DDSI-RTPS)

github.com/ros2/rmw_fastrtps

listener cannot receive the data after restarting container talker node.

opened 03:49AM - 21 Feb 20 UTC

closed 07:57AM - 01 Jun 20 UTC

fujitatomoya

bug

# Bug report **Required Info:** - Operating System: - Ubuntu 18.04 - I…nstallation type: - binaries - Version or commit hash: - ros:eloquent - DDS implementation: - Fast-RTPS - Client library (if applicable): - demo_nodes_cpp #### Steps to reproduce issue $ ros2 run demo_nodes_cpp listener [INFO] [listener]: I heard: [Hello World: 1] [INFO] [listener]: I heard: [Hello World: 2] [INFO] [listener]: I heard: [Hello World: 3] [INFO] [listener]: I heard: [Hello World: 4] [INFO] [listener]: I heard: [Hello World: 5] [INFO] [listener]: I heard: [Hello World: 6] [INFO] [listener]: I heard: [Hello World: 7] [INFO] [listener]: I heard: [Hello World: 8] [INFO] [listener]: I heard: [Hello World: 9] [INFO] [listener]: I heard: [Hello World: 10] [INFO] [listener]: I heard: [Hello World: 11] [INFO] [listener]: I heard: [Hello World: 12] [INFO] [listener]: I heard: [Hello World: 13] [INFO] [listener]: I heard: [Hello World: 14] [INFO] [listener]: I heard: [Hello World: 15] [INFO] [listener]: I heard: [Hello World: 16] [INFO] [listener]: I heard: [Hello World: 17] [INFO] [listener]: I heard: [Hello World: 18] [INFO] [listener]: I heard: [Hello World: 19] [INFO] [listener]: I heard: [Hello World: 20] [INFO] [listener]: I heard: [Hello World: 21] [INFO] [listener]: I heard: [Hello World: 22] -> listener CANNOT receive the data after talker container restarts, please check the following procedure. $ docker run ros2_eloquent ros2 run demo_nodes_cpp talker [INFO] [talker]: Publishing: 'Hello World: 1' [INFO] [talker]: Publishing: 'Hello World: 2' [INFO] [talker]: Publishing: 'Hello World: 3' [INFO] [talker]: Publishing: 'Hello World: 4' [INFO] [talker]: Publishing: 'Hello World: 5' [INFO] [talker]: Publishing: 'Hello World: 6' [INFO] [talker]: Publishing: 'Hello World: 7' [INFO] [talker]: Publishing: 'Hello World: 8' [INFO] [talker]: Publishing: 'Hello World: 9' [INFO] [talker]: Publishing: 'Hello World: 10' [INFO] [talker]: Publishing: 'Hello World: 11' [INFO] [talker]: Publishing: 'Hello World: 12' [INFO] [talker]: Publishing: 'Hello World: 13' [INFO] [talker]: Publishing: 'Hello World: 14' [INFO] [talker]: Publishing: 'Hello World: 15' [INFO] [talker]: Publishing: 'Hello World: 16' [INFO] [talker]: Publishing: 'Hello World: 17' [INFO] [talker]: Publishing: 'Hello World: 18' [INFO] [talker]: Publishing: 'Hello World: 19' [INFO] [talker]: Publishing: 'Hello World: 20' [INFO] [talker]: Publishing: 'Hello World: 21' ... $ docker ps CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES 0901653c0f2d ros2_eloquent "/ros_entrypoint.sh …" 3 seconds ago Up 1 second unruffled_tereshkova $ docker exec 0901653c0f2d ps -ef | grep talker root 1 0 6 03:05 ? 00:00:00 /usr/bin/python3 /opt/ros/eloquent/bin/ros2 run demo_nodes_cpp talker root 256 1 2 03:05 ? 00:00:00 /opt/ros/eloquent/lib/demo_nodes_cpp/talker -> talker Process ID in container is 256. listener on host can receive the data. $ docker rm -f 0901653c0f2d 0901653c0f2d -> kill talker container. $ docker run ros2_eloquent ros2 run demo_nodes_cpp talker [INFO] [talker]: Publishing: 'Hello World: 1' [INFO] [talker]: Publishing: 'Hello World: 2' [INFO] [talker]: Publishing: 'Hello World: 3' [INFO] [talker]: Publishing: 'Hello World: 4' [INFO] [talker]: Publishing: 'Hello World: 5' [INFO] [talker]: Publishing: 'Hello World: 6' [INFO] [talker]: Publishing: 'Hello World: 7' [INFO] [talker]: Publishing: 'Hello World: 8' [INFO] [talker]: Publishing: 'Hello World: 9' [INFO] [talker]: Publishing: 'Hello World: 10' -> restart talker container. and listener on host CANNOT receive the data. $ docker ps CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES 3da0fec3d7f9 ros2_eloquent "/ros_entrypoint.sh …" 6 seconds ago Up 4 seconds angry_pascal $ docker exec 3da0fec3d7f9 ps -ef | grep talker root 1 0 5 03:06 ? 00:00:00 /usr/bin/python3 /opt/ros/eloquent/bin/ros2 run demo_nodes_cpp talker root 256 1 2 03:06 ? 00:00:00 /opt/ros/eloquent/lib/demo_nodes_cpp/talker -> talker Process ID in container is also 256. #### Expected behavior Listener receives the data. #### Actual behavior Listener does not receive the data. #### Additional information Assigning the same PID(Process ID) for the application, DDS reader recognizes DDS Domain GUID as same. https://github.com/eProsima/Fast-RTPS/blob/b4f8d12c0e909d3a76e08bd510fd1718c081bb57/src/cpp/rtps/RTPSDomain.cpp#L119-L157 ## Feature request #### Feature description #### Implementation considerations according to https://www.omg.org/spec/DDSI-RTPS/2.3/PDF, 8.2.4.1 Identifying RTPS entities: The GUID The GUID (Globally Unique Identifier) is an attribute of all RTPS Entities and uniquely identifies the Entity within a DDS Domain.

we are really getting to specific problem, so i will check this out just in case with your configuration and if it does not work, i will create issue and include you.

thanks for bringing this up

SidFaber · December 10, 2020, 2:35pm

Adding another lesson learned to this thread, spent lots of time working through this one: make sure all the nodes in your cluster are uniquely named and they can all find each other. If they are unable to find each other the “join” command will hang or fail and you’ll likely see “not found” type errors in /var/log/syslog.

It’s easy to lay down a stock OS image without changing the hostname, but MicroK8s wants each host to be uniquely named. Use hostnamectl set-hostname to give each node a unique name.

Moreover, since each node in the cluster needs to find all the other nodes, the host names need to be resolvable in DNS. One easy way to do this is with /etc/hosts: here’s an example of host micro2 that’s part of a three-node cluster:

ubuntu@micro2:~$ cat /etc/hosts
127.0.0.1 localhost
127.0.0.1 micro2

192.168.1.22 micro2
192.168.1.23 micro3
192.168.1.24 micro4

Or, if you’re resolving through DNS, make sure all these names are in your DNS zone…and I also had to set up a default DNS search suffix so nodes could find each other without a FQDN.

tomoyafujita · December 15, 2020, 10:41am

Just FYI, we confirmed that we still have this problem described above. and it’s been addressed with Increasing GUID prefix uniqueness [10124] by MiguelCompany · Pull Request #1637 · eProsima/Fast-DDS · GitHub. we’ve confirmed that fixes problem with Kubernetes (talker and listener container in the same pod can communicate)

thanks for the heads-up

Topic		Replies	Views
Robotics Distributed System based on Kubernetes Next Generation ROS ros2 , ros , kubernetes	85	11317	July 9, 2021
Experiences with ROS 2 on our robots and what we learned on the way General ros2	8	3084	August 28, 2022
ROS with Kubernetes Next Generation ROS ros2 , ros	13	3297	December 27, 2022
ROS2 for Multi-Robot System Next Generation ROS ros2	5	6990	November 8, 2019
ROS with Kubernetes #2 Next Generation ROS ros2 , ros	17	1916	February 8, 2023

ROS 2 on Kubernetes

Related topics