yamlpyyamlcross-reference

PyYaml "include file" and yaml aliases (anchors/references)


I had a large YAML file with a massive use of YAML anchors and references, for example:

warehouse:
  obj1: &obj1
    key1: 1
    key2: 2
specific:
  spec1: 
    <<: *obj1
  spec2:
    <<: *obj1
    key1: 10

The file got too large, so I looked for a solution that will allow me split to 2 files: warehouse.yaml and specific.yaml, and to include the warehouse.yaml inside the specific.yaml. I read this simple article, which describes how I can use PyYAML to achieve that, but it also says that the merge key(<<) is not supported.

I really got an error:

yaml.composer.ComposerError: found undefined alias 'obj1

when I tried to go like that.

So, I started looking for alternative way and I got confused because I don't really know much about PyYAML.

Can I get the desired merge key support? Any other solutions for my problem?


Solution

  • Crucial for the handling of anchors and aliases in PyYAML is the dict anchors that is part of the Composer. It maps anchor to nodes so that aliases can be looked up. It existence is limited by the existence of the Composer, which is a composite element of the Loader that you use.

    That Loader class only exists during the time of the call to yaml.load() so there is no trivial way to extract this afterwards: first you would have to make the instance of the Loader() persist and then make sure that the normal compose_document() method is not called (which among other things does self.anchors = {}, to be clean for the next document (in a single stream)).

    To further complicate things if you would have warehouse.yaml:

    warehouse:
      obj1: &obj1
        key1: 1
        key2: 2
    

    and specific.yaml:

    warehouse: !include warehouse.yaml
    specific:
      spec1:
        <<: *obj1
      spec2:
        <<: *obj1
        key1: 10
    

    you would never get this to work with your snippet, even if you could preserve, extract and pass on the anchor information because the composer handling specific.yaml will much earlier encountering a non-defined alias than the tag !include gets used for construction (and filling anchors).

    What you can do to circumvent this problem is to include specific.yaml

    specific:
      spec1:
        <<: *obj1
      spec2:
        <<: *obj1
        key1: 10
    

    from warehouse.yaml:

    warehouse:
      obj1: &obj1
        key1: 1
        key2: 2
    specific: !include specific.yaml
    

    , or include both in a third file. Please note that the key specific is in both files.

    With those two files run:

    import sys
    from ruamel import yaml
    
    def my_compose_document(self):
        self.get_event()
        node = self.compose_node(None, None)
        self.get_event()
        # self.anchors = {}    # <<<< commented out
        return node
    
    yaml.SafeLoader.compose_document = my_compose_document
    
    # adapted from http://code.activestate.com/recipes/577613-yaml-include-support/
    def yaml_include(loader, node):
        with open(node.value) as inputfile:
            return list(my_safe_load(inputfile, master=loader).values())[0]
    #              leave out the [0] if your include file drops the key ^^^
    
    yaml.add_constructor("!include", yaml_include, Loader=yaml.SafeLoader)
    
    
    def my_safe_load(stream, Loader=yaml.SafeLoader, master=None):
        loader = Loader(stream)
        if master is not None:
            loader.anchors = master.anchors
        try:
            return loader.get_single_data()
        finally:
            loader.dispose()
    
    with open('warehouse.yaml') as fp:
        data = my_safe_load(fp)
    yaml.safe_dump(data, sys.stdout, default_flow_style=False)
    

    which gives:

    
    specific:
      spec1:
        key1: 1
        key2: 2
      spec2:
        key1: 10
        key2: 2
    warehouse:
      obj1:
        key1: 1
        key2: 2
    

    If your specific.yaml would not have the top-level key specific:

    
    spec1:
      <<: *obj1
    spec2:
      <<: *obj1
      key1: 10
    

    then replace the last line of yaml_include() with:

    return my_safe_load(inputfile, master=loader)
    

    The above was done with ruamel.yaml (disclaimer: I am the author of that package) and tested on Python 2.7 and 3.6. By changing the import it will work with PyYAML as well.


    With the new ruamel.yaml API the above can be much simplified, because the loader handed to the yaml_include() constructor knows about the YAML instance, but of course you still need an adapted compose_document that doesn't destroy anchors. Assuming the specific.yaml without top-level key specific, the following gives the same output as before.

    import sys
    from ruamel.std.pathlib import Path
    from ruamel.yaml import YAML, version_info
    
    yaml = YAML(typ='safe', pure=True)
    yaml.default_flow_style = False
    
    
    def my_compose_document(self):
        self.parser.get_event()
        node = self.compose_node(None, None)
        self.parser.get_event()
        # self.anchors = {}    # <<<< commented out
        return node
    
    yaml.Composer.compose_document = my_compose_document
    
    # adapted from http://code.activestate.com/recipes/577613-yaml-include-support/
    def yaml_include(loader, node):
        y = loader.loader
        yaml = YAML(typ=y.typ, pure=y.pure)  # same values as including YAML
        yaml.composer.anchors = loader.composer.anchors
        return yaml.load(Path(node.value))
    
    yaml.Constructor.add_constructor("!include", yaml_include)
    
    data = yaml.load(Path('warehouse.yaml'))
    yaml.dump(data, sys.stdout)