PyYaml "include file" and yaml aliases (anchors/references)
Asked Answered
F

2

9

I had a large YAML file with a massive use of YAML anchors and references, for example:

warehouse:
  obj1: &obj1
    key1: 1
    key2: 2
specific:
  spec1: 
    <<: *obj1
  spec2:
    <<: *obj1
    key1: 10

The file got too large, so I looked for a solution that will allow me split to 2 files: warehouse.yaml and specific.yaml, and to include the warehouse.yaml inside the specific.yaml. I read this simple article, which describes how I can use PyYAML to achieve that, but it also says that the merge key(<<) is not supported.

I really got an error:

yaml.composer.ComposerError: found undefined alias 'obj1

when I tried to go like that.

So, I started looking for alternative way and I got confused because I don't really know much about PyYAML.

Can I get the desired merge key support? Any other solutions for my problem?

Fingered answered 4/7, 2017 at 16:45 Comment(1)
I hope you are aware that there is no need to have the anchor for a value to be the same string as the correspoding key (obj1)Comparable
C
11

Crucial for the handling of anchors and aliases in PyYAML is the dict anchors that is part of the Composer. It maps anchor to nodes so that aliases can be looked up. It existence is limited by the existence of the Composer, which is a composite element of the Loader that you use.

That Loader class only exists during the time of the call to yaml.load() so there is no trivial way to extract this afterwards: first you would have to make the instance of the Loader() persist and then make sure that the normal compose_document() method is not called (which among other things does self.anchors = {}, to be clean for the next document (in a single stream)).

To further complicate things if you would have warehouse.yaml:

warehouse:
  obj1: &obj1
    key1: 1
    key2: 2

and specific.yaml:

warehouse: !include warehouse.yaml
specific:
  spec1:
    <<: *obj1
  spec2:
    <<: *obj1
    key1: 10

you would never get this to work with your snippet, even if you could preserve, extract and pass on the anchor information because the composer handling specific.yaml will much earlier encountering a non-defined alias than the tag !include gets used for construction (and filling anchors).

What you can do to circumvent this problem is to include specific.yaml

specific:
  spec1:
    <<: *obj1
  spec2:
    <<: *obj1
    key1: 10

from warehouse.yaml:

warehouse:
  obj1: &obj1
    key1: 1
    key2: 2
specific: !include specific.yaml

, or include both in a third file. Please note that the key specific is in both files.

With those two files run:

import sys
from ruamel import yaml

def my_compose_document(self):
    self.get_event()
    node = self.compose_node(None, None)
    self.get_event()
    # self.anchors = {}    # <<<< commented out
    return node

yaml.SafeLoader.compose_document = my_compose_document

# adapted from http://code.activestate.com/recipes/577613-yaml-include-support/
def yaml_include(loader, node):
    with open(node.value) as inputfile:
        return list(my_safe_load(inputfile, master=loader).values())[0]
#              leave out the [0] if your include file drops the key ^^^

yaml.add_constructor("!include", yaml_include, Loader=yaml.SafeLoader)


def my_safe_load(stream, Loader=yaml.SafeLoader, master=None):
    loader = Loader(stream)
    if master is not None:
        loader.anchors = master.anchors
    try:
        return loader.get_single_data()
    finally:
        loader.dispose()

with open('warehouse.yaml') as fp:
    data = my_safe_load(fp)
yaml.safe_dump(data, sys.stdout, default_flow_style=False)

which gives:


specific:
  spec1:
    key1: 1
    key2: 2
  spec2:
    key1: 10
    key2: 2
warehouse:
  obj1:
    key1: 1
    key2: 2

If your specific.yaml would not have the top-level key specific:


spec1:
  <<: *obj1
spec2:
  <<: *obj1
  key1: 10

then replace the last line of yaml_include() with:

return my_safe_load(inputfile, master=loader)

The above was done with ruamel.yaml (disclaimer: I am the author of that package) and tested on Python 2.7 and 3.6. By changing the import it will work with PyYAML as well.


With the new ruamel.yaml API the above can be much simplified, because the loader handed to the yaml_include() constructor knows about the YAML instance, but of course you still need an adapted compose_document that doesn't destroy anchors. Assuming the specific.yaml without top-level key specific, the following gives the same output as before.

import sys
from ruamel.std.pathlib import Path
from ruamel.yaml import YAML, version_info

yaml = YAML(typ='safe', pure=True)
yaml.default_flow_style = False


def my_compose_document(self):
    self.parser.get_event()
    node = self.compose_node(None, None)
    self.parser.get_event()
    # self.anchors = {}    # <<<< commented out
    return node

yaml.Composer.compose_document = my_compose_document

# adapted from http://code.activestate.com/recipes/577613-yaml-include-support/
def yaml_include(loader, node):
    y = loader.loader
    yaml = YAML(typ=y.typ, pure=y.pure)  # same values as including YAML
    yaml.composer.anchors = loader.composer.anchors
    return yaml.load(Path(node.value))

yaml.Constructor.add_constructor("!include", yaml_include)

data = yaml.load(Path('warehouse.yaml'))
yaml.dump(data, sys.stdout)
Comparable answered 4/7, 2017 at 20:29 Comment(3)
As the comment on the activestate recipe indicates, that include mechanism is far from robust. One should at least subclass YAML to include code to test against files being processed, to prevent infinite recursion. By using typ='safe' you cannot instantiate arbitrary objects, although abusing the !include as is can crash your program.Comparable
This post is quite old. Although there is a most recent update from @maciej about an extension for ramuel.yaml that add the "!include" mechanism out-of-the-box => ramuel.yaml.include. It's not maintained, though. Do you plan to add such feature to your package in the future?Myrticemyrtie
@Myrticemyrtie I was not aware of that answer, nor of the repository mentioned in there. I'll try to look into that when I have some time.Comparable
F
3

It seems that someone has now solved this problem as an extension of ruamel.yaml.

pip install ruamel.yaml.include (source on GitHub)

To get the desired output above:

warehouse.yml

obj1: &obj1
  key1: 1
  key2: 2

specific.yml

specific:
  spec1: 
    <<: *obj1
  spec2:
    <<: *obj1
    key1: 10

Your code would be:

from ccorp.ruamel.yaml.include import YAML

yaml = YAML(typ='safe', pure=True)
yaml.allow_duplicate_keys = True

with open('specific.yml', 'r') as ymlfile:
    return yaml.load(ymlfile)

It also includes a handy !exclude function if you wanted to not have the warehouse key in your output. If you only wanted the specific key, your specific.yml could begin with:

!exclude includes:
- !include warehouse.yml

In that case, your warehouse.yml could also include the top-level warehouse: key.

Fredenburg answered 3/5, 2019 at 18:31 Comment(1)
Myfile.yaml: >> my_key1: - *obj1 - name: my_obj1 >>> pip install ccorp-yaml-include from ccorp.ruamel.yaml.include import YAML yaml = YAML(typ='safe', pure=True) yaml.allow_duplicate_keys = True p = "C:\\Users\\yaml\\Myfile.yaml" f = open(p, 'r') yaml.load(f) >>> ERROR: ruamel.yaml.composer.ComposerError: found undefined alias 'obj1'Humanism

© 2022 - 2024 — McMap. All rights reserved.