Read-through caching container registry

Docker Hub rate limits have been in effect since 2020-11-02T08:00:00Z and have been causing periodic issues on build.ros.org and build.ros2.org.

I am completing the deployment of a caching registry mirror for build.ros.org and build.ros2.org following this recipe. Notably my first attempt to deploy this using each build farm’s repo host failed to resolve the rate limiting issue as the docker-registry version in Ubuntu Xenial appears to pass data through but is not correctly caching actual image files. Our production mirror is running on a Focal host and using the docker-registry package from Ubuntu without issue. I expect that using the official registry container image from Docker would also work but I haven’t tested it.

Configuring the registry mirror requires modification of each host’s docker daemon configuration. There is a draft PR here: https://github.com/ros-infrastructure/buildfarm_deployment/pull/244 to make this a configurable option in buildfarm_deployment_config.

A caching container registry mirror is going to be added to the new 20.04-based build farm deployment and will be enabled by default.

1 Like

Does the read-through cache really help to avoid running into the rate limit?
The limit applies to docker manifest pulls, not the actual image layer pulls…

As far as I understand the read-through caching registry, it will still pull the manifests to check if the image is still up-to-date and then cache the actual layers.

At least I could not find any option on how to control how often the manifests would be read from the main docker registry, which is what actually counts towards the rate limit.

Do you have any more information on how this is done in the registry?

The docker-registry doesn’t log its interactions with the upstream mirror but I expect that it’s doing HEAD requests rather than GET requests to check if there are changes to the manifests and per the documentation HEAD requests are not rate limited.

I wasn’t confident that this would be enough but it withstood an attempt to force the rate limit by making 500 sequential docker-pulls in a loop and the hosts we’ve deployed to use it have not hit the problem since. I won’t claim it’s guaranteed but it’s been working for us for the last 18 hours of heavy activity. (There are still a few failures on our ARM hosts of build.ros2.org which use a custom deployment pipeline and haven’t been gotten to yet.