jsongitserializationyamltree-structure

Store json-like hierarchical data as nested directory tree?


TLDR

I am looking for an existing convention to encode / serialize tree-like data in a directory structure, split into small files instead of one big file.

Background

There are different scenarios where we want to store tree-like data in a file, which can then be tracked in git. Json files can express dependencies for a package manager (e.g. composer for php, npm for node.js). Yml files can define routes, test cases, etc.

Typically a "tree structure" is a combination of key-value lists and "serial" lists, where each value can again be a tree structure.

Very often the order of associative keys is irrelevant, and should ideally be normalized to alphabetic order.

One problem when storing a big tree structure in a single file, be it json or yml, which is then tracked with git, is that you get plenty of merge conflicts if different branches add and remove entries in the same key-value list.

Especially for key-value lists where the order is irrelevant, it would be more git-friendly to store each sub-tree in a separate file or directory, instead of storing them all in one big file.

Technically it should be possible to create a directory structure that is as expressive as json or yml.

Performance concerns can be overcome with caching. If the files are going to be tracked in git, we can assume they are going to be unchanged most of the time.

The main challenges: - How to deal with "special characters" that cause problems in some or most file systems, if used in a file or directory name? - If I need to encode or disambiguate special characters, how can I still keep it pleasant to the eye? - How to deal with limitations to file name length in some file systems? - How to deal with other file system quirks, e.g. case insensitivity? Is this even still a thing? - How to express serial lists, which might contain key-value lists as children? Serial lists cannot be expressed as directories, so its children have to live within the same file. - How can I avoid to reinvent the wheel, creating my own made-up "convention" that nobody else uses?

Desired features: - As expressive as json or yml. - git-friendly. - Machine-readable and -writable. - Human-readable and -editable, perhaps with limitations. - Ideally it should use known formats (json, yml) for structures and values that are expressed within a single file.

Naive approach

Of course the first idea would be to use yml files for literal values and serial lists, and directories for key-value lists (in cases where the order does not matter). In a key-value list, the file or directory names are interpreted as keys, the files and subdirectories as values.

This has some limitations, because not every possible key that would be valid in json or yml is also a valid file name in every file system. The most obvious example would be a slash.

Question

I have different ideas how I would do this myself.

But I am really looking for some kind of convention for this that already exists.

Related questions

Persistence: Data Trees stored as Directory Trees
This is asking about performance, and about using the filesystem like a database - I think.
I am less interested in performance (caching makes it irrelevant), and more about the actual storage format / convention.


Solution

  • The closest thing I can think of that could be seen as some kind of convention for doing this are Linux configuration files. In modern Linux, you often split the configuration of a service into multiple files residing in a certain directory, e.g. /etc/exim4/conf.d/ instead of having a single file /etc/exim/exim4.conf. There are multiple reasons doing this:

    We can learn a bit from this: Separation into distinct files should happen if the semantic of the content is orthogonal, i.e. the semantic of one file does not depend on the semantic of another file. This is of course a rule for sibling files; we cannot really deduct rules for serializing a tree structure as directory tree from it. However, we can definitely see reasons for not splitting every value in an own file.

    You mention problems of encoding special characters into a file name. You will only have this problem if you go against conventions! The implicit convention on file and directory names is that they act as locator / ID for files, never as content. Again, we can learn a bit from Linux config files: Usually, there is a master file that contains an include statement which loads all the split files. The include statement gives a path glob expression which locates the other files. The path to those files is irrelevant for the semantics of their content. Technically, we can do something similar with YAML.

    Assume we want to split this single YAML file into multiple files (pardon my lack of creativity):

    spam:
      spam: spam
      egg: sausage
    baked beans:
    - spam
    - spam
    - bacon
    

    A possible transformation would be this (read stuff ending with / as directory, : starts file content):

    confdir/
      main.yaml:
        spam: !include spammap/main.yaml
        baked beans: !include beans/
      spammap/
        main.yaml:
          spam: !include spam.yaml
          egg: !include egg.yaml
        spam.yaml:
          spam
        egg.yaml:
          sausage
      beans/
        1.yaml:
          spam
        2.yaml:
          spam
        3.yaml:
          bacon
    

    (In YAML, !include is a local tag. With most implementations, you can register a custom constructor for it, thus loading the whole hierarchy as single document.)

    As you can see, I put every hierarchy level and every value into a separate file. I use two kinds of includes: A reference to a file will load the content of that file; a reference to a directory will generate a sequence where each item's value is the content of one file in that directory, sorted by file name. As you can see, the file and directory names are never part of the content, sometimes I opted to name them differently (e.g. baked beans -> beans/) to avoid possible file system problems (spaces in filenames in this case – usually not a serious problem nowadays). Also, I adhere to the filename extension convention (having the files carry .yaml). This would be more quirky if you put content into the file names.

    I named the starting file on each level main.yaml (not needed in beans/ since it's a sequence). While the exact name is arbitrary, this is a convention used in several other tools, e.g. Python with __init__.py or the Nix package manager with default.nix. Then I placed additional files or directories besides this main file.

    Since including other files is explicit, it is not a problem with this approach to put a larger part of the content into a single file. Note that JSON lacks YAML's tags functionality, but you can still walk through a loaded JSON file and preprocess values like {"!include": "path"}.


    To sum up: While there is not directly a convention how to do what you want, parts of the problem have been solved at different places and you can inherit wisdom from that.


    Here's a minimal working example of how to do it with PyYAML. This is just a proof of concept; several features are missing (e.g. autogenerated file names will be ascending numbers, no support for serializing lists into directories). It shows what needs to be done to store information about the data layout while being transparent to the user (data can be accessed like a normal dict structure). It remembers file names stuff has been loaded from and stores to those files again.

    import os.path
    from pathlib import Path
    
    import yaml
    from yaml.reader import Reader
    from yaml.scanner import Scanner
    from yaml.parser import Parser
    from yaml.composer import Composer
    from yaml.constructor import SafeConstructor
    from yaml.resolver import Resolver
    from yaml.emitter import Emitter
    from yaml.serializer import Serializer
    from yaml.representer import SafeRepresenter
    
    class SplitValue(object):
      """This is a value that should be written into its own YAML file."""
    
      def __init__(self, content, path = None):
        self._content = content
        self._path = path
    
      def getval(self):
        return self._content
    
      def setval(self, value):
        self._content = value
    
      def __repr__(self):
        return self._content.__repr__()
    
    class TransparentContainer(object):
      """Makes SplitValues transparent to the user."""
    
      def __getitem__(self, key):
        val = super(TransparentContainer, self).__getitem__(key)
        return val.getval() if isinstance(val, SplitValue) else val
    
      def __setitem__(self, key, value):
        val = super(TransparentContainer, self).__getitem__(key)
        if isinstance(val, SplitValue) and not isinstance(value, SplitValue):
          val.setval(value)
        else:
          super(TransparentContainer, self).__setitem__(key, value)
    
    class TransparentList(TransparentContainer, list):
      pass
    
    class TransparentDict(TransparentContainer, dict):
      pass
    
    
    class DirectoryAwareFileProcessor(object):
      def __init__(self, path, mode):
        self._basedir = os.path.dirname(path)
        self._file = open(path, mode)
    
      def close(self):
        try:
          self._file.close()
        finally:
          self.dispose() # implemented by PyYAML
    
      # __enter__ / __exit__ to use this in a `with` construct
      def __enter__(self):
        return self
    
      def __exit__(self, type, value, traceback):
        self.close()
    
    class FilesystemLoader(DirectoryAwareFileProcessor, Reader, Scanner,
        Parser, Composer, SafeConstructor, Resolver):
      """Loads YAML file from a directory structure."""
      def __init__(self, path):
        DirectoryAwareFileProcessor.__init__(self, path, 'r')
        Reader.__init__(self, self._file)
        Scanner.__init__(self)
        Parser.__init__(self)
        Composer.__init__(self)
        SafeConstructor.__init__(self)
        Resolver.__init__(self)
    
    def split_value_constructor(loader, node):
      path = loader.construct_scalar(node)
      with FilesystemLoader(os.path.join(loader._basedir, path)) as childLoader:
        return SplitValue(childLoader.get_single_data(), path)
    
    FilesystemLoader.add_constructor(u'!include', split_value_constructor)
    
    def transp_dict_constructor(loader, node):
      ret = TransparentDict()
      ret.update(loader.construct_mapping(node, deep=True))
      return ret
    
    # override constructor for !!map, the default resolved tag for mappings
    FilesystemLoader.add_constructor(yaml.resolver.BaseResolver.DEFAULT_MAPPING_TAG,
        transp_dict_constructor)
    
    def transp_list_constructor(loader, node):
      ret = TransparentList()
      ret.append(loader.construct_sequence(node, deep=True))
      return ret
    
    # like above, for !!seq
    FilesystemLoader.add_constructor(yaml.resolver.BaseResolver.DEFAULT_SEQUENCE_TAG,
        transp_list_constructor)
    
    
    class FilesystemDumper(DirectoryAwareFileProcessor, Emitter,
        Serializer, SafeRepresenter, Resolver):
      def __init__(self, path):
        DirectoryAwareFileProcessor.__init__(self, path, 'w')
        Emitter.__init__(self, self._file)
        Serializer.__init__(self)
        SafeRepresenter.__init__(self)
        Resolver.__init__(self)
    
        self.__next_unique_name = 1
        Serializer.open(self)
    
      def gen_unique_name(self):
        val = self.__next_unique_name
        self.__next_unique_name = self.__next_unique_name + 1
        return str(val)
    
      def close(self):
        try:
          Serializer.close(self)
        finally:
          DirectoryAwareFileProcessor.close(self)
    
    def split_value_representer(dumper, data):
      if data._path is None:
        if isinstance(data._content, TransparentContainer):
          data._path = os.path.join(dumper.gen_unique_name(), "main.yaml")
        else:
          data._path = dumper.gen_unique_name() + ".yaml"
      Path(os.path.dirname(data._path)).mkdir(exist_ok=True)
      with FilesystemDumper(os.path.join(dumper._basedir, data._path)) as childDumper:
        childDumper.represent(data._content)
      return dumper.represent_scalar(u'!include', data._path)
    
    yaml.add_representer(SplitValue, split_value_representer, FilesystemDumper)
    
    def transp_dict_representer(dumper, data):
      return dumper.represent_dict(data)
    
    yaml.add_representer(TransparentDict, transp_dict_representer, FilesystemDumper)
    
    def transp_list_representer(dumper, data):
      return dumper.represent_list(data)
    
    # example usage:
    
    # explicitly specify values that should be split.
    myData = TransparentDict({
      "spam": SplitValue({
        "spam": SplitValue("spam", "spam.yaml"),
        "egg": SplitValue("sausage", "sausage.yaml")}, "spammap/main.yaml")})
    
    with FilesystemDumper("root.yaml") as dumper:
      dumper.represent(myData)
    
    # load values from stored files.
    # The loaded data remembers which values have been in which files.
    with FilesystemLoader("root.yaml") as loader:
      loaded = loader.get_single_data()
    
    # modify a value as if it was a normal structure.
    # actually updates a SplitValue
    loaded["spam"]["spam"] = "baked beans"
    # dumps the same structure as before, with the modified value.
    with FilesystemDumper("root.yaml") as dumper:
      dumper.represent(loaded)