SSL failure in buildfarm jobs (ROS1)

Hi, I’m seeing strange, consistent failures of the armhf job for my rosfmt package since a few days:
https://build.ros.org/job/Nbin_ufhf_uFhf__rosfmt__ubuntu_focal_armhf__binary/11/console

It seems it’s failing to download the rosdistro index (https://raw.githubusercontent.com/ros/rosdistro/master/index-v4.yaml) because it cannot verify the SSL certificate. I think it’s because the Docker container does not have the appropriate root certificates installed, but I’m not sure. Does anyone have a similar issue or knows what’s going on?

The strange thing is that it affects only the armhf build…

Perhaps related: osrf/multiarch-docker-image-generation#38?


Edit: probably not: that particular armhf build appears to run on an amd64 agent.

Also seeing the same failure on a few recent releases:

@cottsay did an investigation of this issue last week and identified that for as yet undetermined reasons, openssl rehash is not doing its job in our containers but that running c_rehash is enough to kick things into shape.

His minimal reproduction, which I’ve been using to test, involves installing openssl and ca-certificates and running this test command:

echo "" | openssl s_client -connect raw.githubusercontent.com:443 | grep "Verification: OK" || echo "FAILED"

QEMU makes tracing the rehash process in the “guest” container impossible and I wasn’t able to reproduce the problem using our image or the stock ubuntu:focal image on native hardware. I was also able to produce the issue using the “stock” cross platform image with qemu injected.

docker run -ti -v /usr/bin/qemu-arm-static:/usr/bin/qemu-arm-static --platform armhf ubuntu:focal bash

As a temporary workaround we added this patch which runs c_rehash in the armhf containers but it now looks like we’re getting new errors that weren’t there before when actually building. I’ll be digging into those tomorrow with the goal being to get us back to building and then circle back to figuring out what exactly is happening.

There aren’t Debian armhf builds on the farm but I can see this problem with Debian buster and bullseye containers as well.

For a summary of day two, we haven’t identified the exact problem. When running /usr/bin/qemu-arm-static -strace /usr/bin/openssl rehash to see the interpreted system calls the problem does not exhibit! Making this a very textbook heisenbug. Interestingly, invoking the interpreter explicitly without -strace (/usr/bin/qemu-arm-static /usr/bin/openssl rehash) also doesn’t exhibit the issue.

The issue doesn’t present on an 18.04 host. I didn’t have 18.10, 19.04, or 19.10 images to test with but it does seem like an interaction between the host kernel/libc, qemu, and the container userspace.
Once we we’d gotten that far we switched to trying a second workaround. With the 20.04 deployment one of the things we enabled were native ARM agents. When testing them we found an issue with the bundled default seccomp policy for Ubuntu 20.04, which also reminded me to include the personality system call used by sbcl. That change has been added to our native ARM machine and will soon be part of the cookbook. With it we were able to get a successful build of Nbin_ufhf_uFhf__rosfmt__ubuntu_focal_armhf__binary #21 [Jenkins]
and when Fix label configuration for noetic armhf. by nuclearsandwich · Pull Request #195 · ros-infrastructure/ros_buildfarm_config · GitHub is deployed we’ll be using the native ARM agents for all our Noetic armhf builds.

I’m going to write up what we’ve got on the qemu-discuss list tomorrow hoping for advice or suggesting we’ve found a bug to formally report. If the trail goes cold there we may not have availability to continue investigating the issue when running within QEMU as long as native agents are working for us.

1 Like