ROS Resources: Documentation | Support | Discussion Forum | Service Status | Q&A answers.ros.org

Buildfarm down

According to status.ros.org, build.ros.org has been down quite a while now (getting 503s).

Fans or tubes clogged?

It looks like despite the fact that it went down over the weekend none of our probes triggered to actually put it in an outage state and so I didn’t get any notifications for it.

I brought Jenkins back up and it went down again a few minutes later… Normally I check the dmesg log for out-of-memory issues as the Jenkins JVM lasts 2-3months before it needs a restart. But today I was getting segfaults in aufs. The docker version on the build farm master hadn’t been updated since it’s initial deployment so I brought it up to the latest stable version of docker-ce. We’ll see if that resolves the issues or if we need to bring docker down completely and move to the overlay2 storage driver.

1 Like

As a (hopefully) finally follow up: There were no clogged fans or tubes, although the notification endpoint that our uptime probe was sending to was not the one our status page was listening on. Which is why the top of the status page said the Build Farm component was Operational while our uptime had been flatlining for hours. That’s been fixed. I also renamed “Jenkins Uptime” below to “Build Farm Uptime” so it is clearer that the uptime and the component are the same thing.

Thanks for the update Steve.

I’ve seen some nice segfaults with docker being updated and running containers; those are nice to diagnose and triage.

I’m guessing this was all partly also caused by the recent security issues that affected build.ros.org.

@nuclearsandwich: would you already have an ETA for when the redeployed buildfarm is going on-line / becomes available?

It seems pull request testing via ros-pull-request-builder is still unavailable - or are there any steps that need to be taken by package maintainers to re-enable it?

Thanks for reporting. There’s some configuration missing to get the pull request builder’s credentials updated. I’ve put it in the queue of follow up work and will hopefully have it updated by the end of the week. I’ll post back here when it should be working again.

It took a couple of tries to get the updated credential to stick and I forgot to close the loop here. Pull Request jobs should be back online. Please let me know if a repository of yours is not getting PR jobs queued.

Thanks to @DLu for reminding me to post here.

@nuclearsandwich Awesome, thanks for fixing it!