pythonyamlruamel.yamlshared-state

How do I avoid global state when using custom constructors in ruamel.yaml?


I am using ruamel.yaml to parse a complex YAML document where certain tagged nodes require special treatment. I inject my custom parsing logic using add_multi_constructor, as recommended by the published examples. The problem is that I need to change the injected logic dynamically depending on external states but the decoration methods like add_multi_constructor modify the global state which introduces unacceptable coupling between logically unrelated instances. Here is the MWE:

import ruamel.yaml

def get_loader(parameter):
    def construct_node(constructor: ruamel.yaml.Constructor, tag: str, node: ruamel.yaml.Node):
        return parameter(tag.lstrip("!"), str(node.value))

    loader = ruamel.yaml.YAML()
    loader.constructor.add_multi_constructor("", construct_node)
    return loader

foo = get_loader(lambda tag, node: f"foo: {tag}, {node}")
bar = get_loader(lambda tag, node: f"bar: {tag}, {node}")
print(foo.load("!abc 123"), bar.load("!xyz 456"), sep="\n")

Output:

bar: abc, 123
bar: xyz, 456

Expected:

foo: abc, 123
bar: xyz, 456

I made the following workaround where I create new class instances dynamically to break the coupling:

def get_loader(parameter):
    def construct_node(constructor: ruamel.yaml.Constructor, tag: str, node: ruamel.yaml.Node):
        return parameter(tag.lstrip("!"), str(node.value))

    # Create a new class to prevent state sharing through class attributes.
    class ConstructorWrapper(ruamel.yaml.constructor.RoundTripConstructor):
        pass

    loader = ruamel.yaml.YAML()
    loader.Constructor = ConstructorWrapper
    loader.constructor.add_multi_constructor("", construct_node)
    return loader

My questions are:


Solution

  • IMO you are not misusing the library, just working around its current shortcomings/incompleteness.

    Before ruamel.yaml got the API with the YAML() instance, it had the function based API of PyYAML with a few extensions, and other PyYAML's problems had to be worked around in a similar unnatural way. E.g. I reverted to having classes whose instances could be called (using __call__()) on which methods could then be changed to just have access to YAML documents version parsed from a document (as ruamel.yaml supports YAML 1.2 and 1.1 and PyYAML only (partially) supports 1.1).

    But underneath ruamel.yaml's YAML() instance not all has improved. The code inherited from PyYAML stores the information for the various constructors in the class attributes as lookup tables (on yaml_constructor resp yaml_multi_constructor), and ruamel.yaml still does that (as the full old PyYAML-escque API is effectively still there, and only with version 0.17 has gotten a future deprecation warning).

    Your approach is in so far interesting in that you do:

    loader.constructor.add_multi_constructor("", construct_node)
    

    instead of:

    loader.Constructor.add_multi_constructor("", construct_node)
    

    (you probably know that loader.constructor is a property that instantiates loader.Constructor if necessary, but other readers of this answer might not)

    or even:

    def get_loader(parameter):
        def construct_node(constructor: ruamel.yaml.Constructor, tag: str, node: ruamel.yaml.Node):
            return parameter(tag.lstrip("!"), str(node.value))
    
        # Create a new class to prevent state sharing through class attributes.
        class ConstructorWrapper(ruamel.yaml.constructor.RoundTripConstructor):
            pass
    
        ConstructorWrapper.add_multi_constructor("", construct_node)
    
        loader = ruamel.yaml.YAML()
        loader.Constructor = ConstructorWrapper
        return loader
    

    That your code works, is because constructors are stored on the class attribute as .add_multi_constructor() is a class method.

    So what you do is not entirely safe in the sense of API breakage. ruamel.yaml is not at version 1.0 yet, and (API) changes that potentially break your code could come with any minor version number change. You should set your version dependencies accordingly for your production code (e.g. ruamel.yaml<0.18 ), and update that minor number only after testing with a ruamel.yaml version with a new minor version number.


    It is possible to transparently change the use of the class attributes by updating the classmethods add_constructor() and add_multi_constructor() to "normal" methods and have the initialisation of the lookup tables done in __init__(). Both your examples that call the instance:

    loader.constructor.add_multi_constructor("", construct_node)
    

    will get the desired result, but ruamel.yaml's behaviour would not change when calling add_multi_constructor on the class using:

    loader.Constructor.add_multi_constructor("", construct_node)
    

    However changing classmethods add_constructor() and add_multi_constructor() in this way affects all code out there, that happens to provide an instance instead of the class (and said code being fine with the result).

    It is more likely that two new instance methods will be added either to the Constructor class and to the YAML() instance , and that the class method will be either phased out or changed to check on a class and not an instance being passed in, after a deprecation period with warnings (as will the global functions add_constructor() and add_multi_constructor() inherited from PyYAML).

    The main advice, apart from having your production code fixed on the minor version number, is to make sure your testing code displays PendingDeprecationWarning. If you are using pytest this is the case by default. That should give you ample time to adapt your code to what the warning recommends.

    And if ruamel.yaml's author stops being lazy, he might provide some documentation for such API additions/changes.

    import ruamel.yaml
    import types
    import inspect
    
    
    class MyConstructor(ruamel.yaml.constructor.RoundTripConstructor):
        _cls_yaml_constructors = {}
        _cls_yaml_multi_constructors = {}
    
        def __init__(self, *args, **kw):
            self._yaml_constructors = {
                'tag:yaml.org,2002:null': self.__class__.construct_yaml_null,
                'tag:yaml.org,2002:bool': self.__class__.construct_yaml_bool,
                'tag:yaml.org,2002:int': self.__class__.construct_yaml_int,
                'tag:yaml.org,2002:float': self.__class__.construct_yaml_float,
                'tag:yaml.org,2002:binary': self.__class__.construct_yaml_binary,
                'tag:yaml.org,2002:timestamp': self.__class__.construct_yaml_timestamp,
                'tag:yaml.org,2002:omap': self.__class__.construct_yaml_omap,
                'tag:yaml.org,2002:pairs': self.__class__.construct_yaml_pairs,
                'tag:yaml.org,2002:set': self.__class__.construct_yaml_set,
                'tag:yaml.org,2002:str': self.__class__.construct_yaml_str,
                'tag:yaml.org,2002:seq': self.__class__.construct_yaml_seq,
                'tag:yaml.org,2002:map': self.__class__.construct_yaml_map,
                None: self.__class__.construct_undefined
            }
            self._yaml_constructors.update(self._cls_yaml_constructors)
            self._yaml_multi_constructors = self._cls_yaml_multi_constructors.copy()
            super().__init__(*args, **kw)
    
        def construct_non_recursive_object(self, node, tag=None):
            # type: (Any, Optional[str]) -> Any
            constructor = None  # type: Any
            tag_suffix = None
            if tag is None:
                tag = node.tag
            if tag in self._yaml_constructors:
                constructor = self._yaml_constructors[tag]
            else:
                for tag_prefix in self._yaml_multi_constructors:
                    if tag.startswith(tag_prefix):
                        tag_suffix = tag[len(tag_prefix) :]
                        constructor = self._yaml_multi_constructors[tag_prefix]
                        break
                else:
                    if None in self._yaml_multi_constructors:
                        tag_suffix = tag
                        constructor = self._yaml_multi_constructors[None]
                    elif None in self._yaml_constructors:
                        constructor = self._yaml_constructors[None]
                    elif isinstance(node, ScalarNode):
                        constructor = self.__class__.construct_scalar
                    elif isinstance(node, SequenceNode):
                        constructor = self.__class__.construct_sequence
                    elif isinstance(node, MappingNode):
                        constructor = self.__class__.construct_mapping
            if tag_suffix is None:
                data = constructor(self, node)
            else:
                data = constructor(self, tag_suffix, node)
            if isinstance(data, types.GeneratorType):
                generator = data
                data = next(generator)
                if self.deep_construct:
                    for _dummy in generator:
                        pass
                else:
                    self.state_generators.append(generator)
            return data
    
        def get_args(*args, **kw):
            if kw:
                raise NotImplementedError('can currently only handle positional arguments')
            if len(args) == 2:
                return MyConstructor, args[0], args[1]
            else:
                return args[0], args[1], args[2]
    
        def add_constructor(self, tag, constructor):
            self, tag, constructor = MyConstructor.get_args(*args, **kw)
            if inspect.isclass(self):
                self._cls_yaml_constructors[tag] = constructor
                return
            self._yaml_constructors[tag] = constructor
    
        def add_multi_constructor(*args, **kw): # self, tag_prefix, multi_constructor):
            self, tag_prefix, multi_constructor = MyConstructor.get_args(*args, **kw)
            if inspect.isclass(self):
                self._cls_yaml_multi_constructors[tag_prefix] = multi_constructor
                return
            self._yaml_multi_constructors[tag_prefix] = multi_constructor
    
    def get_loader_org(parameter):
        def construct_node(constructor: ruamel.yaml.Constructor, tag: str, node: ruamel.yaml.Node):
            return parameter(tag.lstrip("!"), str(node.value))
    
        loader = ruamel.yaml.YAML()
        loader.Constructor = MyConstructor
        loader.constructor.add_multi_constructor("", construct_node)
        return loader
    
    foo = get_loader_org(lambda tag, node: f"foo: {tag}, {node}")
    bar = get_loader_org(lambda tag, node: f"bar: {tag}, {node}")
    print('>org<', foo.load("!abc 123"), bar.load("!xyz 456"), sep="\n")
    
    
    def get_loader_instance(parameter):
        def construct_node(constructor: ruamel.yaml.Constructor, tag: str, node: ruamel.yaml.Node):
            return parameter(tag.lstrip("!"), str(node.value))
    
        # Create a new class to prevent state sharing through class attributes.
        class ConstructorWrapper(MyConstructor):
            pass
    
        loader = ruamel.yaml.YAML()
        loader.Constructor = ConstructorWrapper
        loader.constructor.add_multi_constructor("", construct_node)
        return loader
    
    foo = get_loader_instance(lambda tag, node: f"foo: {tag}, {node}")
    bar = get_loader_instance(lambda tag, node: f"bar: {tag}, {node}")
    print('>instance<', foo.load("!abc 123"), bar.load("!xyz 456"), sep="\n")
    
    
    def get_loader_cls(parameter):
        def construct_node(constructor: ruamel.yaml.Constructor, tag: str, node: ruamel.yaml.Node):
            return parameter(tag.lstrip("!"), str(node.value))
    
        # Create a new class to prevent state sharing through class attributes.
        class ConstructorWrapper(MyConstructor):
            pass
    
        loader = ruamel.yaml.YAML()
        loader.Constructor = ConstructorWrapper
        loader.Constructor.add_multi_constructor("", construct_node)
        #      ^ using the virtual class method
        return loader
    
    foo = get_loader_cls(lambda tag, node: f"foo: {tag}, {node}")
    bar = get_loader_cls(lambda tag, node: f"bar: {tag}, {node}")
    print('>cls<', foo.load("!abc 123"), bar.load("!xyz 456"), sep="\n")
    

    which gives:

    >org<
    foo: abc, 123
    bar: xyz, 456
    >instance<
    foo: abc, 123
    bar: xyz, 456
    >cls<
    bar: abc, 123
    bar: xyz, 456