Thanks! Things are still a work in progress. Some issues are I did not hit in our development or staging setups and only showed when trying to keep a busy production service running.
Thanks for the feedback. As we finish the tail end of this migration and prepare for the build.ros.org migration I’ll try to provide a bit more in the status updates, if only to provide better breadcrumbs for myself and the community to expand upon in a full review.
My plan for the path ahead is to get the changes made on our production branch reviewed and into the latest
in preparation for a tagged release of the cookbook, get the chef workflow that we have been iterating on internally publicly documented, and then write up a migration guide like the one I wrote around the time of our Ubuntu Xenial migration ROS Buildfarm October 2017 Guide to new changes
Far and away the largest challenges have been caused by the desire to preserve the artifacts that were created prior to the migration. If you’re willing to scrap your old data setting up a new buildfarm is approaching straightforward, except that the chef workflow is new and different relative to the previous buildfarm deployment workflow and requires documentation.
There were some outright bugs or missing features in the new config that only turned up when we pulled in all the production data and saw “that doesn’t look right”.
There have also been a few unforseen stability issues. An import package job failed due to a timeout waiting for the gpg agent to start up and although the error was recoverable I can’t recall it happening previously. I also saw about 50% of the fleet’s docker daemons crash last night within several minutes of each other and I do not know why. The issue which caused last night’s shutdown is that the chef resource for creating Jenkins credentials is not idempotent and causes Jenkins to try and find an older version of the same credential unsuccessfully. I’ve worked around that for now but it is still in need of a better solution.