Managing robot-level code base

Hi,

I would like to start a conversation about managing and containerizing ROS-based code bases. I start by sharing a possible solution (or two) here, but it would be nice to see how other people deal with the same problems.

My team is developing the software for a ROS2-powered robot. The code base consists of different parts that we call ‘modules’. Modules were introduced to help separation of responsibilities. For example, one such module is called ‘navigation’, which runs the nav2 stack and a few other things. Another might deal with hardware such as lidars and other sensors.

In total we have about 12 modules. So far, we ran each module a separate Docker image, with a docker-compose file to run all of them at once. However, we are starting to face a lot of dependency problems that are getting harder to solve. For example, we have a forked version of the nav2 stack as a submodule in the navigation module. The problem is, we need that same submodule in another module too. This is a problem, because the modules are obviously separate docker build contexts, and git doesn’t allow two submodules in one repository.

A solution for this might be to create a new docker image for the nav2 stack and use it as a base image for the other Dockerfiles. This would however mean that if the base image changes, it invalidates the cache of the depending images.

Another solution is to get rid of the whole module concept and use one Dockerfile for the stack. This solves the above dependency problem, but hurts the ‘separation of responsibilities’ principle, and also makes logs a lot harder to read and filter.

To summarize, here is the list of pros and cons for using a single Dockerfile:

Pros:

  • virtually no dependency problems
  • a single Dockerfile is a lot easier to maintain than 12
  • saves a lot of overhead for not having to install (e.g. ROS or pip) packages in multiple images

Cons:

  • makes reading logs harder
  • hurts ‘separation of responsibilities’ principle
  • harder to develop and debug (for example, you can’t rebuild and restart the package you’re currently working on which would save time)
  • it would not be possible anymore to run different versions of different modules together (which might be a bad practice anyway)
3 Likes

This is what we use on the robot (edge compute). The time to create the single container is not significant as part of CI/CD. We can efficiently deploy testing to infrastructure for SIM and RESIM on HIL and SIL using the container from a registry, then deploy to the robots. It’s not exactly a single container, as we have x86 and aarch64 versions; Using these two containers is simpler for traceability, security reviews and allows us to focus energy on optimizing the two containers rather than a composition of containers. We may at some point add additional containers, however if we do it will be for orthogonal features from what is running in the container for the robot application.

For cloud / edge compute services that require high-availability, we pay the cost to break down the system into containerized micro-services on K8s, as we are working to complete run-time deployment of updates without stopping the cloud / edge services.

In addition we use this approach for features we are not actively developing and depend on from others. Specifically our base OS and platform dependencies (i.e. Jetson, CUDA, TensorRT), come from other teams so we use their container as a base for our ROS2 container.

This approach doesn’t seem useful for your forked version of Nav2 if it’s under active development, but could benefit other packages where you are not actively making changes, creating a common base for each of the modules.

This is an example from one company to inform your conversation on managing containerized ROS-based code bases, but not the only way.

Thanks.

1 Like

The similar issues you have when having multiple computers inside a robot, which is something that can happen very fast. During my PhD I was working a lot with Care-o-bot code base which had up to 4 machines inside the robot. They had shared storage for the workspace and logging (you can pipe logs to only one machine/container using ROS). Maybe this approach is something to consider.

On a side note, what are you hoping to achieve with such modularity? What do you hope it would work better than having everything in a machine or container?

2 Likes

What do you mean by this? As I understand, you have a shared dependency that you want to use it in two dockers, is it correct?

Some years ago I found the structure of duckiebot docker projects intereseting. It’s close to your idea of modularity with docker.

Just to give you some perspective, here are some of my thoughts with respect to the usage of Docker. I think you need to clarify what is the main purpose of this strategy: development or production deployments.

In my experience Docker has been really good with making sure that we ship an immutable image:

  • All ROS dependencies are installed and pinned. This is a big source of pain in production since the buildfarm packages are not strictly tied to one another, and it’s easy to end up with incompatible versions of a single package.
  • The build is always executed from a clean state, so no risk of developers’ personal secrets & configurations leaking into the build
  • Easy to automate. In order to enable fast & reproducible builds, we tended to do a big build of all of our packages together. This ensures that they all built with the same version of dependencies, and made writing integration tests a breeze. Then we used Docker multi-staged builds to distribute the binaries to the correct images.

However, I have found this strategy a little bit difficult to use for development:

  • Docker images are in general immutable, so a lot of the changes you make in the container tend to be short-lived, and this entails a lot of context switching between inside & outside the container. This also causes some issues with running as root inside the container vs as a user outside.
  • No desktop processes without messing around with X.org security
  • Creating slim Docker images is antithetical with having a good debugging experience

My experience has mostly been with creating optimized images for production, and in those cases we tended to prefer reliability and reproducibility (in general freezing everything to a known version) over the versatility and flexibility you seem to be after.

1 Like

I think docker CAN be useful for development as well, especially with ROS. With the ability to share the network of the containers, you can simply connect the GUI application like RViz on your host machine with the processes running in the container(s). We never had any problems with this locally, remote access is another question but I guess that’s not essential for purely development purposes.

We develop outside docker so the immutability of the images is no problem either. Currently our workflow looks more or less like this:

  1. Write code on host machine
  2. Build docker image(s)
  3. Push to the robot’s machine (of course if in simulation, skip this step)
  4. Run & test

This testing on the robot’s machine is a form of “deployment”, so that’s why it is useful to use docker for development. Without docker this would be a lot more complicated, especially if you take into account that more people might be working on the robot (not at the same time, but switching fairly often), using
different versions etc.

This one is true, but also you can use two separate Dockerfiles for prod & dev, or better yet base the prod dockerfile on the dev image and clean it up.

P.S.
I don’t mean to shoot down all your points, just here to generate discussion.

1 Like

As I understand, they use multiple levels of inheritance between the dockerfiles. This wouldn’t really work if you frequently modify base images as then you would have to rebuild all the dependent ones from scratch.

1 Like

I remember similar problem with dependencies in my last work. We had one large Dockerfile with multiple stages: 1) bare ROS2 2) development 3) production.

To handle continuously changing packages during the development, we assumed that in “development” image we will always bind-mount project workspace. It was a decent way to handle all requirements on the robots inside laboratory.

For “production” image we were fetching our own .deb packages from local repository (which were build on packages’ release). However, we were not limited to create just one container - there was always possibility to separate running (multiple) nodes into dedicated containers, just like your modules @redvinaa .

Of course, this approach wasn’t the most optimal way to handle all “modules-containers”. At least, it allowed to maintain just one image, which could run any part of the whole project.

I’m thinking about your case with forked nav2 and I wonder if you have already thought of moving installation (or even adding a reinstallation) of your packages to the end of the Dockerfile.