pythonyamlrasa

Converting Python dictionary to YAML file with lists as multiline strings


I'm trying to convert a Python dictionary of the following form:

{
    "version": "3.1",
    "nlu": [
        {
            "intent": "greet",
            "examples": ["hi", "hello", "howdy"]
        },
        {
            "intent": "goodbye",
            "examples": ["goodbye", "bye", "see you later"]
        }
    ]
 }

to a YAML file of the following form (note the pipes preceding the value associated to each examples key):

version: "3.1"
nlu:
- intent: greet
  examples: |
    - hi
    - hello
    - howdy
- intent: goodbye
  examples: |
    - goodbye
    - bye
    - see you later

Except for needing the pipes (because of Rasa's training data format specs), I'm familiar with how to accomplish this task using yaml.dump().

What's the most straightforward way to obtain the format I'm after?

EDIT: Converting the value of each examples key to a string first yields a YAML file which is not at all reader-friendly, especially given that I have many intents comprising many hundreds of total example utterances.

version: '3.1'
nlu:
- intent: greet
  examples: "  - hi\n  - hello\n  - howdy\n" 
- intent: goodbye
  examples: "  - goodbye\n  - bye\n  - see you later\n"  

I understand that this multi-line format is what the pipe symbol accomplishes, but I'd like to convert it to something more palatable. Is that possible?


Solution

  • Neither my ruamel.yaml nor PyYAML do give easy access to context when dumping a scalar. Without such context you can only render strings differently based on their content and you cannot determine if a list/sequence is the value for a particular key and dump it in a different way then some other value.

    As @larsks already indicated you need to transform the Python list values into a string. I suggest however to do that before dumping with a recursive function so that you do have the necessary context. In this case it is possible to do that in place, which is usually the more easy option to implement. If that is unacceptable (i.e. you need to continue the data structure unmodified after dumping), you can either first make a copy.deepcopy() of your data, or modify the transform_value to create that copy and return it (recursively).

    ruamel.yaml can round-trip your requested output (specifically preserving the literal scalar as is). If you would inspect the type of the value for the key examples. You see that it is not a string, but a ruamel.yaml.scalarstring.LiteralScalarString instance. That instance behaves like a string in Python, but dumps as a literal scalar.

    import sys, io
    import ruamel.yaml
    
    data = {
        "version": "3.1",
        "nlu": [
            {
                "intent": "greet",
                "examples": ["hi", "hello", "howdy"]
            },
            {
                "intent": "goodbye",
                "examples": ["goodbye", "bye", "see you later"]
            }
        ]
     }
    
    yaml = ruamel.yaml.YAML()
    
    def literalize_list(v):
        assert isinstance(v, list)
        buf = io.StringIO()
        yaml.dump(v, buf)
        return ruamel.yaml.scalarstring.LiteralScalarString(buf.getvalue())
    
    def transform_value(d, key, transformation):
        """recursively walk over data structure to find key and apply transformation on the value"""
        if isinstance(d, dict):
            for k, v in d.items():
                if k == key:
                    d[k] = transformation(v)
                else:
                    transform_value(v, key, transformation)
        elif isinstance(d, list):
            for elem in d:
                transform_value(elem, key, transformation)
        
    
    transform_value(data, 'examples', literalize_list)
    
    yaml.dump(data, sys.stdout)
    

    which gives:

    version: '3.1'
    nlu:
    - intent: greet
      examples: |
        - hi
        - hello
        - howdy
    - intent: goodbye
      examples: |
        - goodbye
        - bye
        - see you later
    

    The string value 3.1 needs to be quoted, in order not to be loaded as a float. By default this is dumped as a single quoted scalar (which are easier/quicker to parse in YAML than double quoted scalars). If you want it dumped with double quotes you can do:

    data['version'] = ruamel.yaml.scalarstring.DoubleQuotedScalarString(data['version'])