June was a month of behind the scenes improvements for the OSRF Infrastructure project committee but it was a full one nonetheless.
We’re also welcoming a new project committer: @marcogg is joining us primarily to resume the work on ROS packaging infrastructure for Rust-based packages*. Welcome Marco!
* I realize this is a bit of a teaser! Stay tuned for the next Infrastructure Community Meeting announcement.
Without further ado, here are the June updates
Increasing reliability of CI agents on ci.ros2.org and build.ros2.org
ROS 2 CI and the CI jobs on build.ros2.org are not something that most people in the community interact with outside of those who are contributing to the ROS 2 core. These services allow us to test tightly coupled changes across a set of ROS 2 core repositories in much the same way that a developer would build and test ROS 2 locally.
We’ve been having mounting difficulty maintaining our EC2 spot fleet of Jenkins agents over the last several years due to changes in capacity and volatilityof the spot instance markets in our regions.
@Crola1702 has been working on changes to allow us to run agents across multiple availability zones which should help us maintain capacity for these builds.
Separate from the capacity problems, we’ve also had some build environment integrity issues. We recently learned that our attempted configuration for phased updates in Ubuntu was not being honored when running
apt install and thus rather than trying to opt-out of them, we enable them unconditionally in order to make sure that
apt upgrade operations consider the same set of packages as
apt install Switch to always opting into phased updates. by nuclearsandwich · Pull Request #709 · ros2/ci · GitHub
Investigating performance and stability issues on ci.ros2.org and build.ros2.org
Odd failures in builds on ci.ros2.org had been occurring with increased frequency. In the past, these issues were related to malbehavior of
dockerd in when memory availability was low and that’s the first place our team checked, even going so far as to deploy double-sized (in both CPU core count and memory) instances into the ci.ros2.org agent fleet to see if the increased memory reduced the problem.
Unsurprisingly it did. Surprisingly, not for the reason we anticipated! Thanks to the careful sleuthing of @claraberendsen the actual cause of the issue was unattended-upgrades running on our agents concurrent with our nightly build schedule and restarting dockerd in the middle of a build, causing the interruption. The larger instances, in addition to having more memory, also had an increased CPU core count and the improved build performance reduced the overlap between the nightly build and the unattended-upgrades window.
Work to resolve this is still under way, but along the way @claraberendsen dug deep into the resources used during a build and discovered that Linux builds on ci.ros2.org are heavily CPU bound, which is going to help our efforts to optimize build performance and cost efficiency over the coming months.
@jrivero and I spent some time last month reviewing hosting resources for the build farms and were able to clean up a number of dangling, no longer needed, resources resulting in a significant cost savings on OSRF’s infrastructure budget. We’re using what we learned on this ad-hoc cleanup to revise our provisioning tools and enable automatic cleaning as we go and reduce this overhead in the future.
While we were performing this manual cleanup, we got a little over-zealous and de-provisioned some images which were still being used in production, this prevented new agents in some auto-scaling groups from back-filling after a spot capacity rebalances resulting in reduced agents in some pools. Luckily other members of the Infrastructure team spotted this and were able to quickly build new images using our internal infrastructure.
Another large savings for us is coming from the in-progress migration of the EBS volumes which provide the logical disk storage for our hosts in AWS from
gp2 to the next generation
Not only does
gp3 have a lower cost per GB-month, but the IOPS allocation is no longer tied to volume size. Our
gp2 instances were generally over-provisioned on storage in order to maximize our IOPS allocation and avoid disk IO bottlenecks during builds. With
gp3, the maximum default IOPS allocation from
gp2 is now the baseline for
gp3 regardless of volume size. We were able to reduce the storage provisioned for some classes of build farm agents from 1000GB down to 250GB or 350GB without a penalty on IOPS. Together with the savings from the reduced cost per GB-month we were easily able to budget extra allocation to cover the reduction in baseline throughput from 250MiB/s to 125MiB/s between
gp3 and still net a significant cost reduction.
Since we separately learned that our current builds are CPU-bound rather than IO-bound currently. We aren’t looking at further IOPS or throughput increases on our volumes right now. But we have more headroom for increases later on if we need them.
That’s all for now. Happy Friday!