Rountrip dump expands the merge aliases on encountering duplicate keys with merge

import sys
import ruamel.yaml
yaml_str = """\
hello: world
foo: &core_foo
    s: 1
"""
yaml_str2 = """\
hello1 : world
foo:
    <<: *core_foo
"""
yaml = ruamel.yaml.YAML()
yaml.allow_duplicate_keys = True
yaml.dump(data, sys.stdout)

data = yaml.load(yaml_str + yaml_str2)

I tried to concatenate and read with allowing duplicate keys. While the result of load is as I expected, dump is not preserving the merge, and aliases

Expected:

hello: world
foo:
  <<: *core_foo

hello1: world

Actual:

hello: world
foo:
  s: 1
hello1: world

Is this how it is expected?

Solution

First of all it is unlikely that your program generates the output you show, because you set data by loading the concatenated strings after you dump it. I am also not sure why you concatenate the strings, but that might be a remnant from experimenting with the code.

The behaviour is as expected. When allowing duplicate keys, ruamel.yaml drops any recurring instances. Some other parsers don't check for duplicate keys and silently overwrite the original entry (but by then will have the alias resolved, so the merged mapping data will probably be there). In ruamel.yaml the key-value pair foo and the "merge", although they get parsed, are then dropped. This causes the value for the first key foo to have an anchor, but that value has only one reference. The id (core_foo) is attached to the data structure (as can be seen from the output of the code below)

During dump ruamel.yaml tracks the nodes that are going to be dumped and if the same (Python) id is encountered the first occurence gets an anchor and any following an alias. So essentially you need to wait until you can dump any node, until you know it doesn't need an anchor (i.e. essentially walk over the data structure twice). Since the seconds occurence of foo gets discarded, there is no second reference to the data structure, and the initial occurence never needs an anchor. You can easily check that behaviour by changing foo in your yaml_str2 to a key that doesn't occur in that mapping.

It is however possible to force dump a loaded anchor by setting its always_dump attribute. There is no global option on the YAML() instance to do that, so you either need to know where the anchor is located or recursively walk the data structure:

yaml_str = """\
hello: world
foo: &core_foo
    s: 1
hello1 : world
foo:
    <<: *core_foo
"""
yaml = ruamel.yaml.YAML()
yaml.allow_duplicate_keys = True
data = yaml.load(yaml_str)
print(data['foo'].anchor)
print('=' * 10)
yaml.dump(data, sys.stdout)
print('=' * 10)

def always_dump_anchors(d):
    if isinstance(d, dict):
        for k, v in d.items():
            always_dump_anchors(k)
            always_dump_anchors(v)
    elif isinstance(d, list):
        for elem in d:
            always_dump_anchors(elem)
    if hasattr(d, 'anchor'):
        d.anchor.always_dump = True

always_dump_anchors(data)
yaml.dump(data, sys.stdout)

which gives:

Anchor('core_foo')
==========
hello: world
foo:
  s: 1
hello1: world
==========
hello: world
foo: &core_foo
  s: 1
hello1: world

Keeping track of ids is necessary in any kind of data structure representation that might be self referencing. Since it takes time some "dumpers", like the json package in the standard library allow you to speed things up by specifying your data structure is not self-referencing ( json.dump does this by providing check_circular=False argument). Even your average __repr__ should do this, as became clear when ordereddict originally was added to Python 2: it would crash on self-referential structures, (and that although the author of that change was aware of a test suite for ordereddict implementations that included tests for this)