jsonpython-3.xyamlpicklejsonpickle

Serializing a RangeDict using YAML or JSON in Python


I am using RangeDict to make a dictionary that contains ranges. When I use Pickle it is easily written to a file and later read.

import pickle
from rangedict import RangeDict

rngdct = RangeDict()
rngdct[(1, 9)] = \
    {"Type": "A", "Series": "1"}
rngdct[(10, 19)] = \
    {"Type": "B", "Series": "1"}

with open('rangedict.pickle', 'wb') as f:
    pickle.dump(rngdct, f)

However, I want to use YAML (or JSON if YAML won't work...) instead of Pickle since most of the people seem to hate that (and I want human readable files so they make sense to people reading them)

Basically, changing the code to call for yaml and opening the file in 'w' mode, not in 'wb' does the trick for the writing side, but when I read the file in another script, I get these errors:

File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/yaml/constructor.py", line 129, in construct_mapping
value = self.construct_object(value_node, deep=deep)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/yaml/constructor.py", line 61, in construct_object
"found unconstructable recursive node", node.start_mark)
yaml.constructor.ConstructorError: found unconstructable recursive node

I'm lost here. How can I serialize the rangedict object and read it back in it's original form?


Solution

  • TL;DR; Skip to the bottom of this answer for working code


    I am sure some people hate pickle, it certainly can give some headaches when refactoring code (when the classes of pickled objects move to different files). But the bigger problem is that pickle is insecure, just a YAML is in the way that you used it.

    It is for interesting to note that you cannot pickle to the more readable protocol level 0 (the default in Python 3 is protocol version 3) as:

    pickle.dump(rngdct, f, protocol=0) will throw:

    TypeError: a class that defines slots without defining getstate cannot be pickled

    This is because the RangeDict module/class is a bit minimalistic, which also shows (or rather doesn't) if you try to do:

    print(rngdict)
    

    which will just print {}

    You probably used the PyYAML dump() routine (and its corresponding, unsafe, load()). And although that can dump generic Python classes, you have to realise that that was implemented before or roughly at the same time as Python 3.0. (and Python 3 support was implemented later on). And although there is no reason a YAML parser could dump and load the exact information that pickle does, it doesn't hook into the pickle support routines (although it could) and certainly not into the information for the Python 3 specific pickling protocols.

    Any way, without a specific representer (and constructor) for RangeDict objects, using YAML doesn't really make any sense: it makes loading potentially unsafe and your YAML include all of the gory details that make the object efficient. If you do yaml.dump():

    !!python/object:rangedict.RangeDict
    _root: &id001 !!python/object/new:rangedict.Node
      state: !!python/tuple
      - null
      - color: 0
        left: null
        parent: null
        r: !!python/tuple [1, 9]
        right: !!python/object/new:rangedict.Node
          state: !!python/tuple
          - null
          - color: 1
            left: null
            parent: *id001
            r: !!python/tuple [10, 19]
            right: null
            value: {Series: '1', Type: B}
        value: {Series: '1', Type: A}
    

    Where IMO a readable representation in YAML would be:

    !rangedict
    [1, 9]:
      Type: A
      Series: '1'
    [10, 19]:
      Type: B
      Series: '1'
    

    Because of the sequences used as keys, this cannot be loaded by PyYAML without major modifications to the parser. But fortunately, those modifications have been incorporated in ruamel.yaml (disclaimer: I am the author of that package), so "all" you need to do is subclass RangeDict to provide suitable representer and constructor (class) methods:

    import io
    import ruamel.yaml
    from rangedict import RangeDict
    
    class MyRangeDict(RangeDict):
        yaml_tag = u'!rangedict'
    
        def _walk(self, cur):
            # walk tree left -> parent -> right
            if cur.left:
                for x in self._walk(cur.left):
                    yield x
            yield cur.r
            if cur.right:
                for x in self._walk(cur.right):
                    yield x
    
        @classmethod
        def to_yaml(cls, representer, node):
            d = ruamel.yaml.comments.CommentedMap()
            for x in node._walk(node._root):
                d[ruamel.yaml.comments.CommentedKeySeq(x)] = node[x[0]]
            return representer.represent_mapping(cls.yaml_tag, d)
    
        @classmethod
        def from_yaml(cls, constructor, node):
            d = cls()
            for x, y in node.value:
                x = constructor.construct_object(x, deep=True)
                y = constructor.construct_object(y, deep=True)
                d[x] = y
            return d
    
    
    rngdct = MyRangeDict()
    rngdct[(1, 9)] = \
        {"Type": "A", "Series": "1"}
    rngdct[(10, 19)] = \
        {"Type": "B", "Series": "1"}
    
    yaml = ruamel.yaml.YAML()
    yaml.register_class(MyRangeDict)  # tell the yaml instance about this class
    
    buf = io.StringIO()
    
    yaml.dump(rngdct, buf)
    data = yaml.load(buf.getvalue())
    
    # test for round-trip equivalence:
    for x in data._walk(data._root):
        for y in range(x[0], x[1]+1):
            assert data[y]['Type'] == rngdct[y]['Type']
            assert data[y]['Series'] == rngdct[y]['Series']
    

    The buf.getvalue() is exactly the readable representation shown before.

    If you have to deal with dumping RangeDict itself (i.e. cannot subclass because you use some library that has RangeDict hardcoded), then you can add the attribute and methods of MyRangeDict directly to RangeDict by grafting/monkeypatching.