Suppose I have a file like so
test:
long: "This is a sample text
across two lines."
When I load the file and dump it back with no changes to the file, it changes this document into
test:
long: "This is a sample text\
\ across two lines."
While this is correct and doesn't change the actual value, for huge YAML files this creates a lot of diffs and becomes difficult to look at the valid ones.
This is the code I have used so far
import sys
import ruamel.yaml
from pathlib import Path
yaml = ruamel.yaml.YAML() # defaults to round-trip
yaml.allow_duplicate_keys = True
yaml.preserve_quotes = True
yaml.explicit_start = True
file_name = "ca.yml"
with open(file_name) as fp:
data = yaml.load(fp)
with open(file_name, 'w') as fp:
yaml.dump(data, fp)
Could someone help me understand if there are some settings I'll be able to use to achieve this? or in case it's not possible any workarounds to do the same.
This code was added to ruamel.yaml 0.17.23
I cannot recreate the output so something seems to be missing. In my tests the backslashes went missing, which I expected as I don't recall there is special code for handling newlines in a double quoted scalar, and AFAICT that was only added for folded block style scalars, but that was not the problem.
There are a few things that are strange to me:
.indent(mapping=4)
was set on your YAML
instance
but your code doesn't reflect that..explicit_start = True
, but your output doesn't reflect that.Playing around a bit I could get your output when I set the .width
to a value of 27-32,
and that if you don't set preserve_quotes
the output doesn't get the backslashes (but also not
the quotes):
import sys
import ruamel.yaml
yaml_str = """\
test:
long: "This is a sample text
across two lines."
"""
for pq in [True, False]:
yaml = ruamel.yaml.YAML() # defaults to round-trip
yaml.preserve_quotes = pq
yaml.indent(mapping=4)
yaml.width = 27
yaml.allow_duplicate_keys = True
yaml.explicit_start = True
data = yaml.load(yaml_str)
# check there are no hidden spaces or newlines in the loaded data
assert data["test"]["long"] == 'This is a sample text across two lines.'
yaml.dump(data, sys.stdout)
which gives:
---
test:
long: "This is a sample text\
\ across two lines."
---
test:
long: This is a sample text
across two lines.
So this seems to have to do specifically with the code that dumps strings with style '"
'
BTW, I can recommend not overwriting the input during this kind of testing, instead write the input from the code if you need to do file-to-file loading/dumping, or use string input and sys.stdout output (when doing visual inspection).
This garbage is caused by code forked from PyYAML years ago:
import sys
import yaml # PyYAML
data = yaml.safe_load(yaml_str)
assert data["test"]["long"] == 'This is a sample text across two lines.'
yaml.safe_dump(data, sys.stdout, indent=4, width=27, default_style='"')
which gives:
"test":
"long": "This is a sample\
\ text across two lines."
and that leads to the code for write_double_quoted
in emitter.py
:
class MyEmitter(ruamel.yaml.emitter.Emitter):
def write_double_quoted(self, text, split=True):
if self.root_context:
if self.requested_indent is not None:
self.write_line_break()
if self.requested_indent != 0:
self.write_indent()
self.write_indicator(u'"', True)
start = end = 0
while end <= len(text):
ch = None
if end < len(text):
ch = text[end]
if (
ch is None
or ch in u'"\\\x85\u2028\u2029\uFEFF'
or not (
u'\x20' <= ch <= u'\x7E'
or (
self.allow_unicode
and (u'\xA0' <= ch <= u'\uD7FF' or u'\uE000' <= ch <= u'\uFFFD')
)
)
):
if start < end:
data = text[start:end]
self.column += len(data)
if bool(self.encoding):
data = data.encode(self.encoding)
self.stream.write(data)
start = end
if ch is not None:
if ch in self.ESCAPE_REPLACEMENTS:
data = u'\\' + self.ESCAPE_REPLACEMENTS[ch]
elif ch <= u'\xFF':
data = u'\\x%02X' % ord(ch)
elif ch <= u'\uFFFF':
data = u'\\u%04X' % ord(ch)
else:
data = u'\\U%08X' % ord(ch)
self.column += len(data)
if bool(self.encoding):
data = data.encode(self.encoding)
self.stream.write(data)
start = end + 1
if (
0 < end < len(text) - 1
and (ch == u' ' or start >= end)
and self.column + (end - start) > self.best_width
and split
):
# data = text[start:end] + u'\\' # <<< replaced with following two lines
need_backquote = text[end] == u' ' and (len(text) > end) and text[end+1] == u' '
data = text[start:end] + (u'\\' if need_backquote else u'')
if start < end:
start = end
self.column += len(data)
if bool(self.encoding):
data = data.encode(self.encoding)
self.stream.write(data)
self.write_indent()
self.whitespace = False
self.indention = False
if text[start] == u' ':
if not need_backquote:
# remove leading space it will load from the newline
start += 1
# data = u'\\' # <<< replaced with following line
data = u'\\' if need_backquote else u''
self.column += len(data)
if bool(self.encoding):
data = data.encode(self.encoding)
self.stream.write(data)
end += 1
self.write_indicator(u'"', False)
yaml = ruamel.yaml.YAML()
yaml.Emitter = MyEmitter
yaml.preserve_quotes = True
yaml.indent(mapping=4)
yaml.width = 27
data = yaml.load(yaml_str)
assert data["test"]["long"] == 'This is a sample text across two lines.'
yaml.dump(data, sys.stdout)
which gives:
test:
long: "This is a sample text
across two lines."
Which looks like what you want.
The the code block around the two changed lines generates correctly loadable string, as you noted. !t just deals
very conservatively with potential multiple spaces around the point where a newline is inserted, which
is correct for PyYAML, which has no pretentions to preserve the original YAML document, but incorrect
for ruamel.yaml
. Without those backslahses extra
spaces would otherwise disappear during loading.
yaml_str = 'test:\n long:\n "This is a sample text across two lines."'
yaml = ruamel.yaml.YAML()
yaml.Emitter = MyEmitter
yaml.preserve_quotes = True
yaml.indent(mapping=4)
yaml.width = 27
data = yaml.load(yaml_str)
assert data["test"]["long"] == 'This is a sample text across two lines.'
yaml.dump(data, sys.stdout)
gives:
test:
long: "This is a sample text\
\ across two lines."
because of the double spaces.
It doesn't look like the above has other side-effects, but this has not been further tested.
You should take care with using allow_duplicate_keys
, it will change your output if you have them,
and possible not with the same semantics as another program loading the original document.
You should also consider using the .yaml
extension on files containing YAML documents, assuming
the other programs using this document can handle that. That
has been the recommended extension since at least Septebmer 2006, so I hope some others updated their code
since then.