pythonyamlruamel.yaml

Loading and dumping multiple yaml files with ruamel.yaml (python)


Using python 2 (atm) and ruamel.yaml 0.13.14 (RedHat EPEL)

I'm currently writing some code to load yaml definitions, but they are split up in multiple files. The user-editable part contains eg.

users:
  xxxx1:
    timestamp: '2018-10-22 11:38:28.541810'
    << : *userdefaults
  xxxx2:
    << : *userdefaults
    timestamp: '2018-10-22 11:38:28.541810'

the defaults are stored in another file, which is not editable:

userdefaults: &userdefaults
    # Default values for user settings
    fileCountQuota: 1000
    diskSizeQuota: "300g"

I can process these together by loading both and concatinating the strings, and then running them through merged_data = list(yaml.load_all("{}\n{}".format(defaults_data, user_data), Loader=yaml.RoundTripLoader)) which correctly resolves everything. (when not using RoundTripLoader I get errors that the references cannot be resolved, which is normal)

Now, I want to do some updates via python code (eg. update the timestamp), and for that I need to just write back the user part. And that's where things get hairy. I sofar haven't found a way to just write that yaml document, not both.


Solution

  • First of all, unless there are multiple documents in your defaults file, you don't have to use load_all, as you don't concatenate two documents into a multiple-document stream. If you had by using a format string with a document-end marker ("{}\n...\n{}") or with a directives-end marker ("{}\n---\n{}") your aliases would not carry over from one document to another, as per the YAML specification:

    It is an error for an alias node to use an anchor that does not previously occur in the document.

    The anchor has to be in the document, not just in the stream (which can consist of multiple documents).


    I tried some hocus pocus, pre-populating the already represented dictionary of anchored nodes:

    import sys
    import datetime
    from ruamel import yaml
    
    def load():
        with open('defaults.yaml') as fp:
            defaults_data = fp.read()
        with open('user.yaml') as fp:
            user_data = fp.read()
        merged_data = yaml.load("{}\n{}".format(defaults_data, user_data), 
                                Loader=yaml.RoundTripLoader)
        return merged_data
    
    class MyRTDGen(object):
        class MyRTD(yaml.RoundTripDumper):
            def __init__(self, *args, **kw):
                pps = kw.pop('pre_populate', None)
                yaml.RoundTripDumper.__init__(self, *args, **kw)
                if pps is not None:
                    for pp in pps:
                        try:
                            anchor = pp.yaml_anchor()
                        except AttributeError:
                            anchor = None
                        node = yaml.nodes.MappingNode(
                            u'tag:yaml.org,2002:map', [], flow_style=None, anchor=anchor)
                        self.represented_objects[id(pp)] = node
    
        def __init__(self, pre_populate=None):
            assert isinstance(pre_populate, list)
            self._pre_populate = pre_populate 
    
        def __call__(self, *args, **kw):
            kw1 = kw.copy()
            kw1['pre_populate'] = self._pre_populate
            myrtd = self.MyRTD(*args, **kw1)
            return myrtd
    
    
    def update(md, file_name):
        ud = md.pop('userdefaults')
        MyRTD = MyRTDGen([ud])
        yaml.dump(md, sys.stdout, Dumper=MyRTD)
        with open(file_name, 'w') as fp:
            yaml.dump(md, fp, Dumper=MyRTD)
    
    md = load()
    md['users']['xxxx2']['timestamp'] = str(datetime.datetime.utcnow())
    update(md, 'user.yaml')
    

    Since the PyYAML based API requires a class instead of an object, you need to use a class generator, that actually adds the data elements to pre-populate on the fly from withing yaml.load().

    But this doesn't work, as a node only gets written out with an anchor once it is determined that the anchor is used (i.e. there is a second reference). So actually the first merge key gets written out as an anchor. And although I am quite familiar with the code base, I could not get this to work properly in a reasonable amount of time.

    So instead, I would just rely on the fact that there is only one key that matches the first key of users.yaml at the root level of the dump of the combined updated file and strip anything before that.

    import sys
    import datetime
    from ruamel import yaml
    
    with open('defaults.yaml') as fp:
        defaults_data = fp.read()
    with open('user.yaml') as fp:
        user_data = fp.read()
    merged_data = yaml.load("{}\n{}".format(defaults_data, user_data), 
                            Loader=yaml.RoundTripLoader)
    
    # find the key
    for line in user_data.splitlines():
        line = line.split('# ')[0].rstrip()  # end of line comment, not checking for strings
        if line and line[-1] == ':' and line[0] != ' ':
            split_key = line
            break
    
    merged_data['users']['xxxx2']['timestamp'] = str(datetime.datetime.utcnow())
    
    buf = yaml.compat.StringIO()
    yaml.dump(merged_data, buf, Dumper=yaml.RoundTripDumper)
    document = split_key + buf.getvalue().split('\n' + split_key)[1]
    sys.stdout.write(document)
    

    which gives:

    users:
      xxxx1:
        <<: *userdefaults
        timestamp: '2018-10-22 11:38:28.541810'
      xxxx2:
        <<: *userdefaults
        timestamp: '2018-10-23 09:59:13.829978'
    

    I had to make a virtualenv to make sure I could run the above with ruamel.yaml==0.13.14. That version is from the time I was still young (I won't claim to have been innocent). There have been over 85 releases of the library since then.

    I can understand that you might not be able to run anything but Python2 at the moment and cannot compile/use a newer version. But what you really should do is install virtualenv (can be done using EPEL, but also without further "polluting" your system installation), make a virtualenv for the code you are developping and install the latest version of ruamel.yaml (and your other libraries) in there. You can also do that if you need to distribute your software to other systems, just install virtualenv there as well.

    I have all my utilties under /opt/util, and managed virtualenvutils a wrapper around virtualenv.