@Martin_Guenther What you are describing as a goal is exactly what we are trying to achieve for the ROS buildfarm. But as you have noticed sometimes it does sent false positive notifications which is unfortunate. To comment on the examples you mentioned:
10 emails on 7 separate days caused by KeyError: “The cache has no package named ‘apt-src’” (I’m still getting those)
This is a known issue also mentioned above in this thread (https://github.com/ros-infrastructure/ros_buildfarm/issues/369). As you might imagine such problems can be extremely challenging to troubleshoot so it takes a lot of time and effort. Hopefully the proposed workaround can soon be deployed and addresses the problem.
1 email caused by "Pulling repository docker.io/osrf/ubuntu_armhf \n Could not reach any registry endpoint
You might have read about AWS outtage in the last week which lead to various problems of resources being unreachable. This is one symptom of this event.
1 email caused by E: Package ‘curl’ has no installation candidate
I haven’t seen this myself so I am not sure what was causing it. It would probably be good to investigate if it happens again but we need more information about the problem (e.g. which build, which platform, etc.).
Regarding your suggestions to improve the situation: they all sounds reasonable. But for many of them it is unclear how to actually implement them. E.g. “Distinguish between ‘Failed’ (your fault) and ‘Errored’ (the buildfarm’s fault) states, like Travis does” is a great proposal. It is already something being tracked as a ticket (https://github.com/ros-infrastructure/ros_buildfarm/issues/119). But until now nobody had the ideas and time to figure out how to actually do it. If you have any specific ideas and suggestions or could even provide a patch how to achieve this we would be more than happy to hear and work on this. Even something like “split the install into ‘setup’ and ‘build’ sections” is not as trivial as it sounds since many of these steps are quite mixed.
Provide a dashboard website, where you can see the status of all your own jobs at a glance.
Include the number of unsuccessful builds / days since last successful build in the email subject.
We are using standard Jenkins plugins to send notification emails. If you have a suggestion how to configure them differently (or using a different plugin) to achieve this we would be happy to incorporate this.
Perhaps one approach would be to try to identify “Errored” states heuristically; something like “the repo didn’t have any commits since the last successful build”, or “this is the first failed build in a row”. Then retry the job a couple of times before sending out the email alert.
The buildfarm is already applying several heuristics like this, just a few examples:
If you have any specific cases where the buildfarm could apply heuristics / retries to behave more gracefully we would be happy to hear about them.