I'm trying to convert a Python dictionary of the following form:
{
"version": "3.1",
"nlu": [
{
"intent": "greet",
"examples": ["hi", "hello", "howdy"]
},
{
"intent": "goodbye",
"examples": ["goodbye", "bye", "see you later"]
}
]
}
to a YAML file of the following form (note the pipes preceding the value associated to each examples
key):
version: "3.1"
nlu:
- intent: greet
examples: |
- hi
- hello
- howdy
- intent: goodbye
examples: |
- goodbye
- bye
- see you later
Except for needing the pipes (because of Rasa's training data format specs), I'm familiar with how to accomplish this task using yaml.dump()
.
What's the most straightforward way to obtain the format I'm after?
EDIT: Converting the value of each examples
key to a string first yields a YAML file which is not at all reader-friendly, especially given that I have many intents comprising many hundreds of total example utterances.
version: '3.1'
nlu:
- intent: greet
examples: " - hi\n - hello\n - howdy\n"
- intent: goodbye
examples: " - goodbye\n - bye\n - see you later\n"
I understand that this multi-line format is what the pipe symbol accomplishes, but I'd like to convert it to something more palatable. Is that possible?
Neither my ruamel.yaml
nor PyYAML do give easy access to context when dumping a scalar. Without
such context you can only render strings differently based on their content and you cannot
determine if a list/sequence is the value for a particular key and dump it in a different way then some other value.
As @larsks already indicated you need to transform the Python list values into a string. I suggest however
to do that before dumping with a recursive function so that you do have the necessary context. In
this case it is possible to do that in place, which is usually the more easy option to implement.
If that is unacceptable (i.e. you need to continue the data structure unmodified after dumping), you
can either first make a copy.deepcopy()
of your data, or modify the transform_value
to create
that copy and return it (recursively).
ruamel.yaml
can round-trip your requested output (specifically preserving the literal scalar as is).
If you would inspect the type of the value for the key examples
. You see that it is not a string,
but a ruamel.yaml.scalarstring.LiteralScalarString
instance. That instance behaves like a string
in Python, but dumps as a literal scalar.
import sys, io
import ruamel.yaml
data = {
"version": "3.1",
"nlu": [
{
"intent": "greet",
"examples": ["hi", "hello", "howdy"]
},
{
"intent": "goodbye",
"examples": ["goodbye", "bye", "see you later"]
}
]
}
yaml = ruamel.yaml.YAML()
def literalize_list(v):
assert isinstance(v, list)
buf = io.StringIO()
yaml.dump(v, buf)
return ruamel.yaml.scalarstring.LiteralScalarString(buf.getvalue())
def transform_value(d, key, transformation):
"""recursively walk over data structure to find key and apply transformation on the value"""
if isinstance(d, dict):
for k, v in d.items():
if k == key:
d[k] = transformation(v)
else:
transform_value(v, key, transformation)
elif isinstance(d, list):
for elem in d:
transform_value(elem, key, transformation)
transform_value(data, 'examples', literalize_list)
yaml.dump(data, sys.stdout)
which gives:
version: '3.1'
nlu:
- intent: greet
examples: |
- hi
- hello
- howdy
- intent: goodbye
examples: |
- goodbye
- bye
- see you later
The string value 3.1
needs to be quoted, in order not to be loaded as a float. By default this is dumped
as a single quoted scalar (which are easier/quicker to parse in YAML than double quoted scalars).
If you want it dumped with double quotes you can do:
data['version'] = ruamel.yaml.scalarstring.DoubleQuotedScalarString(data['version'])