Need to sync new release of rqt_topic (Indigo, Jade, Kinetic)

In the last days many reports have come in that rqt_topic doesn’t work anymore (e.g. https://github.com/ros-visualization/rqt_common_plugins/issues/432). I suspect it is related to a Python update (removing an undocumented attribute). I released version 0.4.6 on Monday to address the problem. Since rqt_topic is a crucial tool for many ROS users I think this change should be synced to the public repos asap.

Since Indigo and Jade have just recently been synced and have no regressions they should be synced as soon as the buildfarm has finished the current jobs. I already put incoming rosdistro PRs on hold.

In Kinetic there are several regressions at the moment. Those need to be resolved to get Kinetic into a mergable state. @tfoote will follow up with a list of regressions later.

Kinetic

Looking through the state of Kinetic I’ve identified 4 regressions. Which appear to have 3 root causes.

  1. On Xenial 32 we’ve hit docker failure [8] System error: read parent: connection reset by peer · Issue #337 · ros-infrastructure/ros_buildfarm · GitHub again. There was a workaround in add magic docker fix env var by dirk-thomas · Pull Request #377 · ros-infrastructure/ros_buildfarm · GitHub @dirk-thomas

  2. There are two gazebo plugins that seem to be failing after this release of gazebo_ros_pkgs @jrivero gazebo_ros_pkgs: 2.5.9-0 in 'kinetic/distribution.yaml' [bloom] by j-rivero · Pull Request #13934 · ros/rosdistro · GitHub

hector_gazebo_thermal_camera
http://build.ros.org/view/Kbin_uW64/job/Kbin_uW64__hector_gazebo_thermal_camera__ubuntu_wily_amd64__binary/32/console

Missing dependency

23:35:33 -- catkin 0.7.6
23:35:33 CMake Error at /opt/ros/kinetic/share/gazebo_plugins/cmake/gazebo_pluginsConfig.cmake:141 (message):
23:35:33   Project 'hector_gazebo_thermal_camera' tried to find library '-lpthread'.
23:35:33   The library is neither a target nor built/installed properly.  Did you
23:35:33   compile project 'gazebo_plugins'? Did you find_package() it before the
23:35:33   subdirectory containing its code is included?
23:35:33 Call Stack (most recent call first):
23:35:33   /opt/ros/kinetic/share/catkin/cmake/catkinConfig.cmake:76 (find_package)
23:35:33   CMakeLists.txt:7 (find_package)
23:35:33 
23:35:33 -- Configuring incomplete, errors occurred!
23:35:33 See also "/tmp/binarydeb/ros-kinetic-hector-gazebo-thermal-camera-0.5.0/obj-x86_64-linux-gnu/CMakeFiles/CMakeOutput.log".
23:35:33 See also "/tmp/binarydeb/ros-kinetic-hector-gazebo-thermal-camera-0.5.0/obj-x86_64-linux-gnu/CMakeFiles/CMakeError.log".
23:35:33 
23:35:33 dh_auto_configure: cmake .. -DCMAKE_INSTALL_PREFIX=/usr -DCMAKE_VERBOSE_MAKEFILE=ON -DCMAKE_BUILD_TYPE=None -DCATKIN_BUILD_BINARY_PACKAGE=1 -DCMAKE_INSTALL_PREFIX=/opt/ros/kinetic -DCMAKE_PREFIX_PATH=/opt/ros/kinetic returned exit code 1
23:35:33 make[1]: *** [override_dh_auto_configure] Error 2
23:35:33 make: *** [build] Error 2
23:35:33 dpkg-buildpackage: error: debian/rules build gave error exit status 2
23:35:33 E: Building failed
23:35:33 Traceback (most recent call last):
23:35:33   File "/tmp/ros_buildfarm/ros_buildfarm/binarydeb_job.py", line 133, in build_binarydeb
23:35:33     subprocess.check_call(cmd, cwd=source_dir)
23:35:33   File "/usr/lib/python3.4/subprocess.py", line 561, in check_call
23:35:33     raise CalledProcessError(retcode, cmd)
23:35:33 subprocess.CalledProcessError: Command '['apt-src', 'build', 'ros-kinetic-hector-gazebo-thermal-camera']' returned non-zero exit status 1

velodyne_gazebo_plugins:

http://build.ros.org/view/Kbin_uW64/job/Kbin_uW64__velodyne_gazebo_plugins__ubuntu_wily_amd64__binary/20/console

Link error: 11:18:16 -- catkin 0.7.6
11:18:17 -- Using these message generators: gencpp;geneus;genlisp;gennodejs;genpy
11:18:17 CMake Error at /opt/ros/kinetic/share/gazebo_plugins/cmake/gazebo_pluginsConfig.cmake:141 (message):
11:18:17   Project 'velodyne_gazebo_plugins' tried to find library '-lpthread'.  The
11:18:17   library is neither a target nor built/installed properly.  Did you compile
11:18:17   project 'gazebo_plugins'? Did you find_package() it before the subdirectory
11:18:17   containing its code is included?
11:18:17 Call Stack (most recent call first):
11:18:17   /opt/ros/kinetic/share/catkin/cmake/catkinConfig.cmake:76 (find_package)
11:18:17   CMakeLists.txt:4 (find_package)
11:18:17 
11:18:17 
11:18:17 -- Configuring incomplete, errors occurred!
11:18:17 See also "/tmp/binarydeb/ros-kinetic-velodyne-gazebo-plugins-1.0.3/obj-x86_64-linux-gnu/CMakeFiles/CMakeOutput.log".
11:18:17 See also "/tmp/binarydeb/ros-kinetic-velodyne-gazebo-plugins-1.0.3/obj-x86_64-linux-gnu/CMakeFiles/CMakeError.log".
11:18:17 dh_auto_configure: cmake .. -DCMAKE_INSTALL_PREFIX=/usr -DCMAKE_VERBOSE_MAKEFILE=ON -DCMAKE_BUILD_TYPE=None -DCATKIN_BUILD_BINARY_PACKAGE=1 -DCMAKE_INSTALL_PREFIX=/opt/ros/kinetic -DCMAKE_PREFIX_PATH=/opt/ros/kinetic returned exit code 1
11:18:17 debian/rules:26: recipe for target 'override_dh_auto_configure' failed
11:18:17 make[1]: Leaving directory '/tmp/binarydeb/ros-kinetic-velodyne-gazebo-plugins-1.0.3'
  1. The kobuki_ftdi release appears to have a regression in the install rules: kobuki_core: 0.7.4-0 in 'kinetic/distribution.yaml' [bloom] by stonier · Pull Request #14003 · ros/rosdistro · GitHub @Daniel_Stonier

Jade

Looking at Jade it appears to be in a good state still rebuilding a few armhf packages.

Indigo

And Indigo had a few jobs that needed retriggering but it’s also still rebuilding on all platforms.

//cc @davetcoleman

Thanks Tully. For reference, the origin of the problem is this PR in the gazebo_ros_pkgs repo. We are discussing how to handle the the issue in upstream catkin.

If the maintainers can not patch the packages during the next week I can try to get some time to work on patching both, just ping me.

Update:

Indigo

There’s one package remaining building but otherwise it looks good.

Jade

It looks good we’ll review and probably sync both Jade and Indigo in the morning.

Kinetic

@Daniel_Stonier Thanks for the quick fix.
@jrivero Can we roll the release back while we discuss a more complete solution so we can get a release out?
@dirk-thomas I’ve created a PR https://github.com/ros-infrastructure/ros_buildfarm/pull/388 to attempt to avoid the docker json issue

Yes, I’ve reverted two commits that I think that could cause the regression. Both compiles in my local workspace. gazebo_ros_pkgs: 2.5.10-0 in 'kinetic/distribution.yaml' [bloom] by j-rivero · Pull Request #14084 · ros/rosdistro · GitHub

1 Like

@jrivero Thanks it looks like that’s all rebuilding cleanly.

Looking at the build status there are a few packages that weren’t caught in the previous triage. We have one package regressing, jsk_pcl_ros blocking two downstream packages. current state

@k-okada If you could take a look that would be great. We’d like to get a fix out on Monday.
I created an issue here: https://github.com/jsk-ros-pkg/jsk_recognition/issues/2022

The buildfarm has finished the rebuilds on all target platforms but there are more broken packages atm. openhrp3 is broken since almost two weeks on Xenial armhf as well as armv8:

@130s @k-okada @tfoote Can you please confirm that you have received the 18 notifications emails the buildfarm has sent you for the failing armhf builds?

I had to retrigger some jobs for Debian Jessie armv8 again which failed due to https://github.com/ros-infrastructure/ros_buildfarm/issues/369 If the Jessie builds finish by Monday I would suggest to move forward with the sync since in my opinion the benefit of deploying the critical patch to an important package like rqt_topic outweighs the potential removal of a single package on arm which has failed for a while already.

In my inbox,

  • "Kbin_uxhf_uXhf__openhrp3__ubuntu_xenial_armhf__binary" returns 14 emails (excluding this thread).
  • "Kbin_uxv8_uXv8__openhrp3__ubuntu_xenial_arm64__binary" returns 0.

The maintainer notification for arm64 is not enabled (see https://github.com/ros-infrastructure/ros_buildfarm_config/blob/5326af879d6b88bbd793eade7c5ca4aa065c9fe0/kinetic/release-xenial-arm64-build.yaml#L13). I am not sure why that is the case. This is at the discretion of the person managing a specific ROS distro. So it is expected that you didn’t receive any notification emails for that platform.

While 18 armhf jobs failed only 14 sent actual emails. If the job fails for “some internal” reasons the buildfarm doesn’t send notification emails. Therefore 14 emails is the actually expected count.

But regarding the armhf builds: I am interested why we got into the situation which we are in now (we want to sync quickly to get urgent fixed out to ROS users but can’t due to regressions). It seems that the notification system aiming to let the maintainers as well as the ROS distro maintainer know about the problem worked as expected. It would be great if you could describe your point of view. Is there anything which can be done differently in order to avoid similar problems in the future? Either for the infrastructure, the process, or anything else.

All rebuilds have finished. The only regression is openhrp3 on the three arm platforms. Therefore Kinetic is ready to be synced at @tfoote’s discretion.

Since the buildfarm has caught up and we only had the one package regression on a subset of architectures, I’ve triggered the sync to get it out to everyone.

@k-okada @130s I’ve ticketed the issue here: https://github.com/fkanehiro/openhrp3/issues/123 We can do another quick sync if there’s an updated version of openhrp3 with a fix soon.

@dirk-thomas since arm64 was added experimentally in Kinetic I haven’t turned on notifications and generally don’t consider it gating for a sync.

1 Like

I can’t speak for @130s, but something similar has happened to me before: I oversaw a valid buildfarm email because it was lost in the noise. Here’s a breakdown of the Jenkins “Build Failed” emails I got over the last 23 days:

  • 10 emails on 7 separate days caused by KeyError: "The cache has no package named 'apt-src'" (I’m still getting those)
  • 1 email caused by "Pulling repository docker.io/osrf/ubuntu_armhf \n Could not reach any registry endpoint
  • 1 email caused by E: Package 'curl' has no installation candidate
  • 0 emails caused by something I did

Also, it’s really hard to spot the cause of the error. The emails are just a wall of text without highlighting, so even scrolling to the bottom takes a while, especially on a phone (which is where I do the initial screening of my email most of the time). When you’ve conscientiously spent a full minute trying to figure out whether it’s a real build failure, only to find it’s not in 90% of cases, it just conditions people to simply ignore the emails (that’s at least what happens to me). Especially because the only thing you can do to fix it is wait until the buildfarm sorts itself out.

So, what could be done?

  • Distinguish between “Failed” (your fault) and “Errored” (the buildfarm’s fault) states, like Travis does. The easiest way to get this right 100% of the time is probably to split the install into “setup” and “build” sections. If anything fails during setup, it’s “Errored”, if it fails during build, it’s “Failed”.
  • Provide a dashboard website, where you can see the status of all your own jobs at a glance.
  • Include the number of unsuccessful builds / days since last successful build in the email subject. If something is continuously failing for two weeks, it might not be a buildfarm fluke, but something I can fix. With the current stream of emails alternating between “Build Failed” and “Build is back to normal”, it’s hard to see when there’s one repo that’s consistently failing.
  • Try to get the number of false positives to as close to 0 as possible! Travis has never sent me a bogus “Failed” email. Perhaps one approach would be to try to identify “Errored” states heuristically; something like “the repo didn’t have any commits since the last successful build”, or “this is the first failed build in a row”. Then retry the job a couple of times before sending out the email alert.

@Martin_Guenther What you are describing as a goal is exactly what we are trying to achieve for the ROS buildfarm. But as you have noticed sometimes it does sent false positive notifications which is unfortunate. To comment on the examples you mentioned:

10 emails on 7 separate days caused by KeyError: “The cache has no package named ‘apt-src’” (I’m still getting those)

This is a known issue also mentioned above in this thread (debian builds are failing on KeyError: "The cache has no package named 'apt-src'" · Issue #369 · ros-infrastructure/ros_buildfarm · GitHub). As you might imagine such problems can be extremely challenging to troubleshoot so it takes a lot of time and effort. Hopefully the proposed workaround can soon be deployed and addresses the problem.

1 email caused by "Pulling repository docker.io/osrf/ubuntu_armhf \n Could not reach any registry endpoint

You might have read about AWS outtage in the last week which lead to various problems of resources being unreachable. This is one symptom of this event.

1 email caused by E: Package ‘curl’ has no installation candidate

I haven’t seen this myself so I am not sure what was causing it. It would probably be good to investigate if it happens again but we need more information about the problem (e.g. which build, which platform, etc.).

Regarding your suggestions to improve the situation: they all sounds reasonable. But for many of them it is unclear how to actually implement them. E.g. “Distinguish between ‘Failed’ (your fault) and ‘Errored’ (the buildfarm’s fault) states, like Travis does” is a great proposal. It is already something being tracked as a ticket (update jobs to not report failures if it fails before setup completes · Issue #119 · ros-infrastructure/ros_buildfarm · GitHub). But until now nobody had the ideas and time to figure out how to actually do it. If you have any specific ideas and suggestions or could even provide a patch how to achieve this we would be more than happy to hear and work on this. Even something like “split the install into ‘setup’ and ‘build’ sections” is not as trivial as it sounds since many of these steps are quite mixed.

Provide a dashboard website, where you can see the status of all your own jobs at a glance.

Include the number of unsuccessful builds / days since last successful build in the email subject.

We are using standard Jenkins plugins to send notification emails. If you have a suggestion how to configure them differently (or using a different plugin) to achieve this we would be happy to incorporate this.

Perhaps one approach would be to try to identify “Errored” states heuristically; something like “the repo didn’t have any commits since the last successful build”, or “this is the first failed build in a row”. Then retry the job a couple of times before sending out the email alert.

The buildfarm is already applying several heuristics like this, just a few examples:

If you have any specific cases where the buildfarm could apply heuristics / retries to behave more gracefully we would be happy to hear about them.

@dirk-thomas I hope my post didn’t come across as complaining about the buildfarm too much. The intent was to answer your question why regressions slip through maintainer’s notice even though “the notification system worked as expected”. In my experience, one factor why this happens is the poor signal-to-noise ratio. But perhaps I shouldn’t have hijacked your conversation with @130s.

I appreciate all the work you’ve put into the build farm, and I’m not complaining, given that I don’t have time to work on implementing my suggestions myself.

If I understand this correctly, before a build results in “Failure” (and triggers a notification email), it is rescheduled and retried twice at some later time? Strange. I would have expected that this would catch almost all the intermittent errors. But I guess it’s a question of probabilities: Since the buildfarm isn’t just triggered whenever I commit something, but whenever a dependency changes, and since the same build is run on many platforms, even a small probability of any single build erroring results in a substantial rate of false alarms.

@Martin_Guenther I am sorry if I gave that impression. I highly appreciate the ideas and the will to make the infrastructure better. I just noticed that the existing ticket https://github.com/ros-infrastructure/ros_buildfarm/issues/119 was actually created based on your feedback in the mailing list over a year ago :wink:

The rate in which even common operations fail when considering the scale of the ROS buildfarm is certainly a problem. We try to workaround those in many places. But there is certainly room for improvement to get closer to no false positives.

I think the main “problem” at the moment is that we don’t have a clear idea how to “catch” any of the remaining false positives. Looking at some of those it is relatively easy for a human to decide if it is a real error or a false positive. Sometimes the same failure can be a false positive, sometime it is not :worried: If we can come up with a proposal how to distinguish even only some of the false positives reliably I would be more than happy to work on a patch.

I’m sorry that I didn’t notice I was asked for an opinion. And I don’t have much to say for the improvement given that the buildfarm maintainers are already doing what they can. I’m also aware of a discussion from last year. With all of these it simply seems not easy.

That said I personally react to the notifications from the buildfarm only if it repeats twice or more (e.g. For the 1st batch of notifications for a package A, I almost ignore them. I’ll look into it once I receive the 2nd batch). This works for me because in some (or many) cases with me the notification happened only once and they stop (not sure this is the same as what you guys call as false-positive). I know I shouldn’t ignore any of notifications, but knowing that false-positive could happen this has been working at least on my side.

(For the particular case of openhrp3 discussed above, I’ve been no longer maintaining (just unassigned myself) so can’t speak for any issue of it.)

No, but I just wanted to make sure. Sometimes nuances in tone get lost via text. :slight_smile: