In the last days many reports have come in that rqt_topic doesn’t work anymore (e.g. https://github.com/ros-visualization/rqt_common_plugins/issues/432). I suspect it is related to a Python update (removing an undocumented attribute). I released version 0.4.6 on Monday to address the problem. Since rqt_topic is a crucial tool for many ROS users I think this change should be synced to the public repos asap.
Since Indigo and Jade have just recently been synced and have no regressions they should be synced as soon as the buildfarm has finished the current jobs. I already put incoming rosdistro PRs on hold.
In Kinetic there are several regressions at the moment. Those need to be resolved to get Kinetic into a mergable state. @tfoote will follow up with a list of regressions later.
Link error: 11:18:16 -- catkin 0.7.6
11:18:17 -- Using these message generators: gencpp;geneus;genlisp;gennodejs;genpy
11:18:17 CMake Error at /opt/ros/kinetic/share/gazebo_plugins/cmake/gazebo_pluginsConfig.cmake:141 (message):
11:18:17 Project 'velodyne_gazebo_plugins' tried to find library '-lpthread'. The
11:18:17 library is neither a target nor built/installed properly. Did you compile
11:18:17 project 'gazebo_plugins'? Did you find_package() it before the subdirectory
11:18:17 containing its code is included?
11:18:17 Call Stack (most recent call first):
11:18:17 /opt/ros/kinetic/share/catkin/cmake/catkinConfig.cmake:76 (find_package)
11:18:17 CMakeLists.txt:4 (find_package)
11:18:17
11:18:17
11:18:17 -- Configuring incomplete, errors occurred!
11:18:17 See also "/tmp/binarydeb/ros-kinetic-velodyne-gazebo-plugins-1.0.3/obj-x86_64-linux-gnu/CMakeFiles/CMakeOutput.log".
11:18:17 See also "/tmp/binarydeb/ros-kinetic-velodyne-gazebo-plugins-1.0.3/obj-x86_64-linux-gnu/CMakeFiles/CMakeError.log".
11:18:17 dh_auto_configure: cmake .. -DCMAKE_INSTALL_PREFIX=/usr -DCMAKE_VERBOSE_MAKEFILE=ON -DCMAKE_BUILD_TYPE=None -DCATKIN_BUILD_BINARY_PACKAGE=1 -DCMAKE_INSTALL_PREFIX=/opt/ros/kinetic -DCMAKE_PREFIX_PATH=/opt/ros/kinetic returned exit code 1
11:18:17 debian/rules:26: recipe for target 'override_dh_auto_configure' failed
11:18:17 make[1]: Leaving directory '/tmp/binarydeb/ros-kinetic-velodyne-gazebo-plugins-1.0.3'
@jrivero Thanks it looks like that’s all rebuilding cleanly.
Looking at the build status there are a few packages that weren’t caught in the previous triage. We have one package regressing, jsk_pcl_ros blocking two downstream packages. current state
The buildfarm has finished the rebuilds on all target platforms but there are more broken packages atm. openhrp3 is broken since almost two weeks on Xenial armhf as well as armv8:
@130s@k-okada@tfoote Can you please confirm that you have received the 18 notifications emails the buildfarm has sent you for the failing armhf builds?
I had to retrigger some jobs for Debian Jessie armv8 again which failed due to https://github.com/ros-infrastructure/ros_buildfarm/issues/369 If the Jessie builds finish by Monday I would suggest to move forward with the sync since in my opinion the benefit of deploying the critical patch to an important package like rqt_topic outweighs the potential removal of a single package on arm which has failed for a while already.
While 18 armhf jobs failed only 14 sent actual emails. If the job fails for “some internal” reasons the buildfarm doesn’t send notification emails. Therefore 14 emails is the actually expected count.
But regarding the armhf builds: I am interested why we got into the situation which we are in now (we want to sync quickly to get urgent fixed out to ROS users but can’t due to regressions). It seems that the notification system aiming to let the maintainers as well as the ROS distro maintainer know about the problem worked as expected. It would be great if you could describe your point of view. Is there anything which can be done differently in order to avoid similar problems in the future? Either for the infrastructure, the process, or anything else.
All rebuilds have finished. The only regression is openhrp3 on the three arm platforms. Therefore Kinetic is ready to be synced at @tfoote’s discretion.
Since the buildfarm has caught up and we only had the one package regression on a subset of architectures, I’ve triggered the sync to get it out to everyone.
I can’t speak for @130s, but something similar has happened to me before: I oversaw a valid buildfarm email because it was lost in the noise. Here’s a breakdown of the Jenkins “Build Failed” emails I got over the last 23 days:
10 emails on 7 separate days caused by KeyError: "The cache has no package named 'apt-src'" (I’m still getting those)
1 email caused by "Pulling repository docker.io/osrf/ubuntu_armhf \n Could not reach any registry endpoint
1 email caused by E: Package 'curl' has no installation candidate
0 emails caused by something I did
Also, it’s really hard to spot the cause of the error. The emails are just a wall of text without highlighting, so even scrolling to the bottom takes a while, especially on a phone (which is where I do the initial screening of my email most of the time). When you’ve conscientiously spent a full minute trying to figure out whether it’s a real build failure, only to find it’s not in 90% of cases, it just conditions people to simply ignore the emails (that’s at least what happens to me). Especially because the only thing you can do to fix it is wait until the buildfarm sorts itself out.
So, what could be done?
Distinguish between “Failed” (your fault) and “Errored” (the buildfarm’s fault) states, like Travis does. The easiest way to get this right 100% of the time is probably to split the install into “setup” and “build” sections. If anything fails during setup, it’s “Errored”, if it fails during build, it’s “Failed”.
Provide a dashboard website, where you can see the status of all your own jobs at a glance.
Include the number of unsuccessful builds / days since last successful build in the email subject. If something is continuously failing for two weeks, it might not be a buildfarm fluke, but something I can fix. With the current stream of emails alternating between “Build Failed” and “Build is back to normal”, it’s hard to see when there’s one repo that’s consistently failing.
Try to get the number of false positives to as close to 0 as possible! Travis has never sent me a bogus “Failed” email. Perhaps one approach would be to try to identify “Errored” states heuristically; something like “the repo didn’t have any commits since the last successful build”, or “this is the first failed build in a row”. Then retry the job a couple of times before sending out the email alert.
@Martin_Guenther What you are describing as a goal is exactly what we are trying to achieve for the ROS buildfarm. But as you have noticed sometimes it does sent false positive notifications which is unfortunate. To comment on the examples you mentioned:
10 emails on 7 separate days caused by KeyError: “The cache has no package named ‘apt-src’” (I’m still getting those)
You might have read about AWS outtage in the last week which lead to various problems of resources being unreachable. This is one symptom of this event.
1 email caused by E: Package ‘curl’ has no installation candidate
I haven’t seen this myself so I am not sure what was causing it. It would probably be good to investigate if it happens again but we need more information about the problem (e.g. which build, which platform, etc.).
Regarding your suggestions to improve the situation: they all sounds reasonable. But for many of them it is unclear how to actually implement them. E.g. “Distinguish between ‘Failed’ (your fault) and ‘Errored’ (the buildfarm’s fault) states, like Travis does” is a great proposal. It is already something being tracked as a ticket (update jobs to not report failures if it fails before setup completes · Issue #119 · ros-infrastructure/ros_buildfarm · GitHub). But until now nobody had the ideas and time to figure out how to actually do it. If you have any specific ideas and suggestions or could even provide a patch how to achieve this we would be more than happy to hear and work on this. Even something like “split the install into ‘setup’ and ‘build’ sections” is not as trivial as it sounds since many of these steps are quite mixed.
Provide a dashboard website, where you can see the status of all your own jobs at a glance.
Include the number of unsuccessful builds / days since last successful build in the email subject.
We are using standard Jenkins plugins to send notification emails. If you have a suggestion how to configure them differently (or using a different plugin) to achieve this we would be happy to incorporate this.
Perhaps one approach would be to try to identify “Errored” states heuristically; something like “the repo didn’t have any commits since the last successful build”, or “this is the first failed build in a row”. Then retry the job a couple of times before sending out the email alert.
The buildfarm is already applying several heuristics like this, just a few examples:
@dirk-thomas I hope my post didn’t come across as complaining about the buildfarm too much. The intent was to answer your question why regressions slip through maintainer’s notice even though “the notification system worked as expected”. In my experience, one factor why this happens is the poor signal-to-noise ratio. But perhaps I shouldn’t have hijacked your conversation with @130s.
I appreciate all the work you’ve put into the build farm, and I’m not complaining, given that I don’t have time to work on implementing my suggestions myself.
If I understand this correctly, before a build results in “Failure” (and triggers a notification email), it is rescheduled and retried twice at some later time? Strange. I would have expected that this would catch almost all the intermittent errors. But I guess it’s a question of probabilities: Since the buildfarm isn’t just triggered whenever I commit something, but whenever a dependency changes, and since the same build is run on many platforms, even a small probability of any single build erroring results in a substantial rate of false alarms.
@Martin_Guenther I am sorry if I gave that impression. I highly appreciate the ideas and the will to make the infrastructure better. I just noticed that the existing ticket https://github.com/ros-infrastructure/ros_buildfarm/issues/119 was actually created based on your feedback in the mailing list over a year ago
The rate in which even common operations fail when considering the scale of the ROS buildfarm is certainly a problem. We try to workaround those in many places. But there is certainly room for improvement to get closer to no false positives.
I think the main “problem” at the moment is that we don’t have a clear idea how to “catch” any of the remaining false positives. Looking at some of those it is relatively easy for a human to decide if it is a real error or a false positive. Sometimes the same failure can be a false positive, sometime it is not If we can come up with a proposal how to distinguish even only some of the false positives reliably I would be more than happy to work on a patch.
I’m sorry that I didn’t notice I was asked for an opinion. And I don’t have much to say for the improvement given that the buildfarm maintainers are already doing what they can. I’m also aware of a discussion from last year. With all of these it simply seems not easy.
That said I personally react to the notifications from the buildfarm only if it repeats twice or more (e.g. For the 1st batch of notifications for a package A, I almost ignore them. I’ll look into it once I receive the 2nd batch). This works for me because in some (or many) cases with me the notification happened only once and they stop (not sure this is the same as what you guys call as false-positive). I know I shouldn’t ignore any of notifications, but knowing that false-positive could happen this has been working at least on my side.
(For the particular case of openhrp3 discussed above, I’ve been no longer maintaining (just unassigned myself) so can’t speak for any issue of it.)