suggestion/discussion: drop/contribute to libyaml&pyyaml and default to YAML1.2

First time posting here, please guide me if this is in the wrong category.

The problem

YAML1.2 was published in 2009 (14 years ago!), and widely-used parsers such as yaml-cpp have adopted its syntax.
Yet ironically, the “official” implementations still aren’t 100% compliant. (libyaml, pyyaml, yaml 1.2 tracker for both)

Why should we be concerned?

Yaml 1.1 → 1.2 had some breaking changes, and more and more libraries are defaulting to 1.2 - after all, it’s been 14 years since release.
The following example is a comparison of what will be parsed in python, using pyyaml (yaml1.1) and ruamel.yaml (yaml1.2):

>>> import yaml
>>> import ruamel.yaml as ryaml
>>> yaml.safe_load("{y: 1, yes: 2, on: 3, true: 4}")
{'y': 1, True: 4}
>>> ryaml.safe_load("{y: 1, yes: 2, on: 3, true: 4}")
{'y': 1, 'yes': 2, 'on': 3, True: 4}

>>> yaml.safe_load("{octal_1_1: 010, octal_1_2: 0o10}")
{'octal_1_1': 8, 'octal_1_2': '0o10'}
>>> ryaml.safe_load("{octal_1_1: 010, octal_1_2: 0o10}")
{'octal_1_1': 10, 'octal_1_2': 8}

>>> yaml.safe_load('0b1')
1
>>> ryaml.safe_load('0b1')
1  # this is a bug, this should be a string, or invalid

As you can see, the parsed output is quite different. While it’s not yet a problem, we should keep in mind that this might cause some trouble in the future.
(We can avoid this using linters, but new users will expect that we’re using the “latest” spec - 1.2)

So what next?

I want to have a discussion on how do we handle this potential problem.
I currently have 2 ideas:

  1. Call to arms: dedicate some time to contribute to libyaml & pyyaml
    It will be slow, but the “preferred open source approach”
  2. Drop libyaml&pyyaml, and use replacements
    libfyaml has been suggested in libyaml’s YAML1.2 thread, although I haven’t personally tested.
    ruamel.yaml is known to be a nearly drop-in replacement against pyyaml, that supports YAML1.2.

Edit: more examples, and a link to yaml’s changelog

Wow, that’s a surprising difference between the two parses there. Could you outline why this is the behavior the yaml library is taking? I can replicate it, but I don’t fully understand why it’s not displaying the values for the yes and on keys. Something to do with ‘yes’, ‘on’, and ‘true’ all parsing to ‘True’?

Do we have a sense for what the required effort would be to update pyyaml and libyaml, especially relative to the effort required to shift the ROS libraries? Similarly, what is the potential impact (e.g., do we think this is a necessary shift)?

Sure, basically up until yaml1.1, unquoted “yes” and “on” were treated as boolean true.
Here’s the full changelog if you’re interested: yaml 1.2.2 changes

AFAIK, pyyaml has an pr that adds 1.2 support that the maintainer said will be merged in November. Although I haven’t tested it myself, so I can’t say if it covers the full 1.2 spec.
As for libyaml, I have no idea.

I’m pretty sure fixing pyyaml&libyaml is easier - only 2 compared to hundreds or thousands of packages.
The problem is, their pr approval process seem to be quite slow.

E.g. even if pyyaml’s pr gets merged in November as the maintainer said, it’ll be 2+ years since the last modification to the pulled branch.

I personally think it’s not a large enough problem (yet) to move away from libyaml&pyyaml - but that’s just me. I’m hoping to get feedback from more people.

1 Like

Personally, I think that it is not yet a problem. At least, besides this thread I haven’t seen complaints about adding YAML 1.2 support. So my hope is that PyYAML and libyaml will gain support before it becomes a problem.

Nevertheless, I took a quick look at the situation. First of all, we can only possibly do a transition if the relevant packages are available in the underlying distribution. For Ubuntu Jammy, it looks like ruamel and libfyaml are both packaged, so that is good. It also looks like RHEL-9 has ruamel packaged, but not libfyaml. So we’d have to do some work in EPEL to get libfyaml packaged. And then we’d have to do something about Windows (ruamel can probably come from pip, but we’d have to build or vendor libfyaml).

Second, we need to have rosdep keys for the dependencies. ruamel alerady has one, but we’d have to add libfyaml (this is easy to do, but just a step that needs to happen).

Third, we’d have to update https://ci.ros2.org to add in the dependencies.

Fourth, we would need to transition the code. The difficulty level of this is going to be determined by how similar the APIs are and how many packages depend on them. For instance, looking at all currently released Rolling packages, it looks like there are ~18 packages that depend on PyYAML, and about half of those are in the core. Depending on how close the ruamel API is to the PyYAML API, that could be a relatively straightforward transition. The situation with libyaml is even easier. Looking at currently released Rolling packages, the only one that depends on libyaml in any way is rcl and rcl_yaml_parser, both of which are in the core.

And finally, we’d have to update the documentation to point to the new packages (mostly in the Windows case, but we may have other things lying around that reference libyaml).

All in all, it is quite a bit of work to make the change. It’s all very possible, but I think it will be much easier to just add that support to PyYAML and libyaml if we can.

1 Like

Do you think these counts can be trusted? In my experience, people don’t know yaml is not a part of the core Python library, so they very often omit to list the dependency. And as it is used so widely, it happens most of the time that it gets pulled transitively, so nobody notices…

1 Like

That’s fair. I instead did a search for import yaml and from yaml import, and you are right, there are more instances, about 40 packages that use it. That’s still not insurmountable, though it would be a lot of work to transition.

1 Like

Certainly this is something that is a bit frustratingly slow on the uptake. However as mentioned above there haven’t been a lot of obvious complaints about it. Certainly yaml 1.2 is an improvement. However I think that patience here is a virtue.

If this was an important blocker for something it might make sense to push forward and make a concerted effort to change. However, this will require us to diverge from the official and much more popular default yaml parser. Which means that we’re making the decision to change away from the default for potentially all of our users.

As a quick comparision in Debian, pyyaml has an order of magnitude larger user base

Pyyaml on popcon


https://qa.debian.org/popcon.php?package=pyyaml

ruamel.yaml on popcon


https://qa.debian.org/popcon.php?package=ruamel.yaml

And with that larger user base typically comes more stability, sometimes slower uptake on new features, but also it is what people expect.

If we were to make a case for switching to ruamel.yaml or another yaml parser by default I would want to see an argument that is better and will be better in the long run, with better maintenance and the expectation that it will be better in the long run. And that we would never want to switch back to pyyaml because it’s fundamentally not the right choice for us.

Instead when I look at this I see that it feels more like a fork of pyyaml which has pushed forward to make yaml 1.2 the default and has done some other minor changes. But in general it’s mostly a feature fork. It also in completely possible to use in parallel and users can choose to use it themselves in their own code. If yaml 1.2 was not targeted for integration into pyyaml.

Part of being a distribution as well as being part of distributions is that there’s a large community process by which things update over time. And staying with the default is the way to take advantage of that choice in the long run. I don’t think that the benefits of making an effort to be an early adopter will pay off.

I haven’t had a chance to look at the cpp side of things, but I would want to see a similar path of logic to believe that we should make a switch away from a default/official implementation.

4 Likes

I know this might sound silly. What about actually running both parsers and issuing a warning if it parses differently? This could bring the best of both worlds: keeping compatibility with the default parser, while warning the users about deprecated/dangerous yaml constructs…