Hey folks, Dashing user here. Something concerning happened a few weeks ago that is continuing to have ripple effects in our work. That “something concerning” is this:
I’ve read through the issue addressed by that PR and I have no argument with any of it. What I want to discuss is this: why was this introduced in a stable release? An LTS, no less? These marker files didn’t need to be specifically installed, and now all of a sudden they do. Our CI broke without us changing anything, which turned into a flag day trying to find what broke and fix it, holding up releases and development. Several other ROS 2 packages broke, which sadly burned another team member’s time tracking down the same issue.
This is continuing to effect us, weeks later. My understanding is that we have a policy of breaking changes being okay between releases (e.g. this would have been fine landing into Eloquent), but can someone please explain how this made it back to Dashing?
I’m not trying to justify it, but from a technical perspective, colcon is not part of the release. It is released for each operating system. So we cannot change this particular item in just Eloquent and not Dashing.
This level of regression was certainly unforeseen and unfortunate.
In general the pure Python packages (like catkin_pkg) are not ROS distro specific. On non-Debian platforms they are installed via pip and on Debian-based platforms there are python(3)-<name> packages available across all targeted Ubuntu/Debian distros. As such an update can’t be made available for only ROS Eloquent but not for ROS Dashing (since both target the very same Ubuntu distro).
If the level of issues would have been foreseen I would guess the change wouldn’t have been merged as is. What made this hard to find was that if you had an existing workspace these marker files already existed so just testing the patch didn’t show any problems. Also on CI for running the tests this wasn’t a problem (and where it was it got resolved by PRs).
Adding a warning (as now proposed in https://github.com/colcon/colcon-ros/pull/79) before removing the automagic creation of the marker files would have certainly been the better approach.
While the warning is appreciated, it only would have helped track down the issue once everything broke, not prevented the breakage in the first place.
Thank you for the explanation. While there is certainly a technical difference, utilities used within stable releases tend to be lumped into “the stable release” by us users, especially ones so pivotal as the build system. Breaking those tools feels the same as breaking the release. This seems like an unfortunate design decision, but it’s one we need to own, not hide behind. I’m assuming we can’t/don’t want to change it, but it backs us into a bit of a corner. As far as I can see, there are two possible paths afforded by such a decision:
Developing colcon is hard because you can essentially never break it as all stable releases are using the same version.
Using colcon is hard because all stable releases are using the same version which means non-backward-compatible changes hit all users, even those of stable releases.
To be completely frank, in my opinion (1) would be the responsible ownership of this design. That does not seem to be the path we’re on. If you insist on (2), can we at least come up with a strict scheduled deprecate-then-remove policy?
That is the very same for each and every Python package we ship. And yes, while it is making things difficult we have managed pretty ok for the past ten+ years with that approach.
Also I don’t think the alternative (releasing different version into each ROS distro) would make things easier. E.g. you wouldn’t be able to call any command until you have sourced a distro (chick-egg-problem) which isn’t really feasible for a build tool which you need to bootstrap your very first build.
The main issue here is awareness. When we do know changes are breaking stuff we are very carefully considering how to roll them out and/or modify the change to be compatible (e.g. REPs for evolving the rosdistro database).
If preferred by the community I am absolutely fine with reverting the change for now and only go with the proposed warning. I think though there needs to be a possible transition to move forward at some point (after giving users a transition period and then breaking them if they haven’t acted on it in a visible way). Otherwise we look us into the current state with no possibility to evolve in favor of stability.
I’m a bit stunned by this response. I tried to create this thread to talk about exactly this. Are you implying that you see nothing wrong with what happened? There are no improvements to be had?
To be clear, I’m not suggesting that colcon be shipped in the underlay, it could just be a namespaced binary package frozen on a major colcon version (I’ll ignore the fact that this change was introduced in the same major version ), ros-dashing-colcon or some such thing. The challenge there is having the ability to have both D’s colcon and E’s colcon co-installable.
I do indeed think it should be reverted, but of course I’m only part of “the community.”
I agree, but progress should never come at the expense of your users. I guess we’ll agree to disagree on that point. If you must continue with the current design and if you cannot maintain backwards compatibility for utilities used within stable releases, then can we please establish a deprecate-then-remove policy with a clear schedule?
Your suggestion seems to be very Debian-centric. How should that be done for other platforms (e.g. macOS and Windows) where users install the Python packages from PyPI? I doubt there is any precedence in the Python community for doing N parallel releases of the same package based on the major version.
Anyway the infrastructure we use to release Python packages doesn’t support such a workflow. If you would like to see a drastic change like this that would probably require a thorough design ticket / REP to motivate the change, describe how each and every piece in the pipeline should change as well as coming up with patches to realize that proposal.
Imo such a drastic change is neither needed nor warranted by this case. As mentioned already before if this problem would have been catched during review it would not have been merged and released in the first place. So the main area of improvement I see is that such changes should be reviewed more carefully - which also implies by more people which have to volunteer to do so.
On a side note: following semver when the major version if zero breaks can happen even on a minor version bump.
That being said: the change was introduced in a patch release. Again: this only happened because we were not aware of that significant regression. Otherwise we would have not only bumped the minor version but actually revised the PR before it got merged and release.
A tick-tock cycle is certainly desired.
One problem is that deprecations as they are done in Python are by default not visible. So commonly nobody notices them during the “tick” cycle and only gets hit when stuff breaks during “tock”.
colcon-core#232 introduces an option to change that. The goal would be to recommend to everyone (e.g. CI) who wants to catch deprecations / warnings early to set the flag so something other than the default. Similar to what the Python warnings module recommends to test framework developers 1.
For this specific case the PR colcon-ros#79 reverts the change and adds warnings in the case where colcon-ros installs the files implicitly for the package. Please feel free to provide feedback on that ticket (and its open questions).
Can you expand on how you would use this to fix the issue for macOS and Windows (or even non-deb based Linux distros)? I think it is easy to point out related features/technology, but difficult to apply it in practice. The “devil’s in the details” and all that.
I can imagine a few ways to use those features in our stack/release process but none that jump out at me as being strictly better.
I think it’s worth reiterating these two points as well: a) it was an unintentional break (we would have done it differently had we known impact), and b) colcon is following semver, but colcon is not >= 1.0.0 yet so the rules are much less strict. So in this particular case, I don’t think using anything but an exact version number constraint would have helped avoid the issue.
Fair criticism. Again, though, my point is less to debate the design and more to point out that the design on which you have settled has ramifications that I don’t currently feel are being respected.
Then why not immediately revert and go a different path? I see numerous pull requests to core projects fixing the fallout from this. There were numerousquestions about it. But not until I whined was anything done about it. In my opinion, if it was truly unintentional, it should have been reverted as soon as you noticed breakage.
I appreciate that things are happening now, thank you both for that. However, I’d like to see some lessons learned and changes made, or this type of thing can easily happen again, further harming ROS 2’s reputation. Defining a tick/tock cycle and sticking to it is a satisfactory change.
Unless I misunderstand (I am not following colcon development extremely closely right now), that is what’s happening. We’re unbreaking people and introducing a warning instead, as soon as possible.
Those were to master (will become Eloquent) which we knew needed to be fixed. Our mistake was failing to imagine what would happen (if anything) to Dashing. Normally we wouldn’t do that. Dirk even asked in a ROS 2 meeting with everyone there and no one thought to bring this up.
In fact, I think the idea was to make the change and use our nightly to find the problems and fix them, which in retrospect was reckless, but again we didn’t think about the fact that it affected all the distributions immediately. We were in our most commonly correct mindset of “the latest release only affects master”.
Sorry you and others were on the bleeding edge on this one, but I still haven’t heard a suggestion of what to change that we can act on. We already use tick-tock situationally between releases, and in the “extra-distro” python packages where it is possible.
We would have used that here, but we didn’t notice the issue until too late. I think the plan is to use it now, retroactively.
That’s exactly the ramification of the design choice that I was referring to above that I didn’t feel was respected. It’s an easy mistake to make (because you designed it that way), and I see nothing preventing it from happening again. Again, not trying to debate the design, just saying you gotta own it. We’re trying to convince customers that this is the choice for production robotics. “Oops, won’t do it again” doesn’t tend to convince customers that well.
Let’s say we created a repo of packages for each maintained ROS 2 distro. These packages are built and their tests are run each time colcon changes. Colcon releases are blocked on those tests being green. That way each time a breakage like this happens, we add a new package/test to ensure it doesn’t happen again.
Again, making such statements after the fact doesn’t help. Now in retrospect it is easy to state what could have been done differently (nobody argues about that). The numerous PRs were aiming to fix the problems - the scope just got bigger over time - and now we are moving forward with the revert.
The fact that we do have such an important PR to revert the change pending and it doesn’t get much attention in terms of feedback / review is one part of the problem. When it comes to the build tool or the build system in ROS 2 it often comes down to me doing all the work without others stepping up. I pretty frequently have to beg people to add their +1 in order to get PRs merged. If you ask me that fact is significantly contributing to this problem.
I don’t think that was ever the intention. A lot of our CI builds turned over just fine because these missing files are only an issue in very specific scenarios.
We do have numerous CI jobs already. Pretty much all of them can be run (and are being run) for proposed changes. Additionally nightly jobs ensure that the current state works. But as with any kind of automated tests not all cases are covered by those. In some use cases users ran into problem where they used rosdep on an installed underlay which lacked some package manifests and therefore didn’t install the necessary dependencies.
There is a limit to what you can do and cover by automated tests. I am not saying we shouldn’t try. But just stating “create a repo with packages and build and test it” isn’t going to cut it imo since it is completely unclear what the difference to the existing CI would be.
A possibility that is not yet being discussed is the use of virtual environments. Installing colcon and other pure python packages in a virtualenv would allow you to not pollute the system-level site packages.
Pardon my ignorance, but is it possible to freeze colcon releases for ros2 releases as a separate package instead to putting them in deb files that @kyrofa suggested (or was this what you were suggesting in your latest post)?
Look, obviously there’s a gap in the existing tests, which is natural, but we should be able to plug it. Is there already a suite of end-to-end tests that use Dashing + the new Colcon release? If so I’d happily add such a test there.
I just want a sensible retrospective here. “Colcon accidentally broke. We’re awfully sorry about that. In order to help ensure it doesn’t happen again, we _____”. Can we finish that sentence with something other than “hope”?
While that is certainly an option users would need to maintain separate venvs for each ROS distribution. Especially on Debian-based platforms that would mean not using Debian packages for colcon. That in itself posed a huge drawback since the venv doesn’t get updated by the package manager. Imo the overhead pushed onto each and every user significantly outweighs the isolation advantage.
Every existing CI job uses colcon to build a workspace. There are specific jobs for Dashing which by default use the released version of colcon but can be parameterized to use any branch. The problem is that these jobs might not fail for this specific case. Why? Because the tests often don’t require the presence of e.g. the package manifest. The use case described in Cannot locate rosdep definition in ros:nightly build · Issue #317 · osrf/docker_images · GitHub as an example where the missing manifests were a problem doesn’t exist in the CI jobs which only build a single workspace.
In the meantime the PR which had been pending for a while got reviewed, merged and released as colcon-ros 0.1.13 which should be available from Debian as well as PyPI for users to update.
It really is a shame that Debian doesn’t provide something like Gentoo’s versatile slots and eselect mechanism, but I think it would probably go against the nature of the platform.