Preserving multi-line string as is when round-triping in ruamel

Suppose I have a file like so

test:
    long: "This is a sample text
      across two lines."

When I load the file and dump it back with no changes to the file, it changes this document into

test:
    long: "This is a sample text\
      \ across two lines."

While this is correct and doesn't change the actual value, for huge YAML files this creates a lot of diffs and becomes difficult to look at the valid ones.

This is the code I have used so far

import sys
import ruamel.yaml
from pathlib import Path

yaml = ruamel.yaml.YAML()  # defaults to round-trip
yaml.allow_duplicate_keys = True
yaml.preserve_quotes = True
yaml.explicit_start = True
file_name = "ca.yml"

with open(file_name) as fp:
    data = yaml.load(fp)

with open(file_name, 'w') as fp:
    yaml.dump(data, fp)

Could someone help me understand if there are some settings I'll be able to use to achieve this? or in case it's not possible any workarounds to do the same.

Solution

This code was added to ruamel.yaml 0.17.23

I cannot recreate the output so something seems to be missing. In my tests the backslashes went missing, which I expected as I don't recall there is special code for handling newlines in a double quoted scalar, and AFAICT that was only added for folded block style scalars, but that was not the problem.

There are a few things that are strange to me:

your output is indented as if .indent(mapping=4) was set on your YAML instance but your code doesn't reflect that.
your code sets .explicit_start = True, but your output doesn't reflect that.
your output wraps (around column 30), but there is no code for that.

Playing around a bit I could get your output when I set the .width to a value of 27-32, and that if you don't set preserve_quotes the output doesn't get the backslashes (but also not the quotes):

import sys
import ruamel.yaml

yaml_str = """\
test:
    long: "This is a sample text
      across two lines."
"""

for pq in [True, False]:
    yaml = ruamel.yaml.YAML()  # defaults to round-trip
    yaml.preserve_quotes = pq
    yaml.indent(mapping=4)
    yaml.width = 27
    yaml.allow_duplicate_keys = True
    yaml.explicit_start = True

    data = yaml.load(yaml_str)
    # check there are no hidden spaces or newlines in the loaded data
    assert data["test"]["long"] == 'This is a sample text across two lines.'
    yaml.dump(data, sys.stdout)

which gives:

---
test:
    long: "This is a sample text\
        \ across two lines."
---
test:
    long: This is a sample text
        across two lines.

So this seems to have to do specifically with the code that dumps strings with style '"'

BTW, I can recommend not overwriting the input during this kind of testing, instead write the input from the code if you need to do file-to-file loading/dumping, or use string input and sys.stdout output (when doing visual inspection).

This garbage is caused by code forked from PyYAML years ago:

import sys
import yaml  # PyYAML

data = yaml.safe_load(yaml_str)
assert data["test"]["long"] == 'This is a sample text across two lines.'
yaml.safe_dump(data, sys.stdout, indent=4, width=27, default_style='"')

which gives:

"test":
    "long": "This is a sample\
        \ text across two lines."

and that leads to the code for write_double_quoted in emitter.py:

class MyEmitter(ruamel.yaml.emitter.Emitter):
    def write_double_quoted(self, text, split=True):
        if self.root_context:
            if self.requested_indent is not None:
                self.write_line_break()
                if self.requested_indent != 0:
                    self.write_indent()
        self.write_indicator(u'"', True)
        start = end = 0
        while end <= len(text):
            ch = None
            if end < len(text):
                ch = text[end]
            if (
                ch is None
                or ch in u'"\\\x85\u2028\u2029\uFEFF'
                or not (
                    u'\x20' <= ch <= u'\x7E'
                    or (
                        self.allow_unicode
                        and (u'\xA0' <= ch <= u'\uD7FF' or u'\uE000' <= ch <= u'\uFFFD')
                    )
                )
            ):
                if start < end:
                    data = text[start:end]
                    self.column += len(data)
                    if bool(self.encoding):
                        data = data.encode(self.encoding)
                    self.stream.write(data)
                    start = end
                if ch is not None:
                    if ch in self.ESCAPE_REPLACEMENTS:
                        data = u'\\' + self.ESCAPE_REPLACEMENTS[ch]
                    elif ch <= u'\xFF':
                        data = u'\\x%02X' % ord(ch)
                    elif ch <= u'\uFFFF':
                        data = u'\\u%04X' % ord(ch)
                    else:
                        data = u'\\U%08X' % ord(ch)
                    self.column += len(data)
                    if bool(self.encoding):
                        data = data.encode(self.encoding)
                    self.stream.write(data)
                    start = end + 1
            if (
                0 < end < len(text) - 1
                and (ch == u' ' or start >= end)
                and self.column + (end - start) > self.best_width
                and split
            ):
                # data = text[start:end] + u'\\'  # <<< replaced with following two lines
                need_backquote = text[end] == u' ' and (len(text) > end) and text[end+1] == u' '
                data = text[start:end] + (u'\\' if need_backquote else u'')
                if start < end:
                    start = end
                self.column += len(data)
                if bool(self.encoding):
                    data = data.encode(self.encoding)
                self.stream.write(data)
                self.write_indent()
                self.whitespace = False
                self.indention = False
                if text[start] == u' ':
                    if not need_backquote:
                        # remove leading space it will load from the newline
                        start += 1 
                    # data = u'\\'    # <<< replaced with following line
                    data = u'\\' if need_backquote else u''
                    self.column += len(data)
                    if bool(self.encoding):
                        data = data.encode(self.encoding)
                    self.stream.write(data)
            end += 1
        self.write_indicator(u'"', False)

yaml = ruamel.yaml.YAML()
yaml.Emitter = MyEmitter
yaml.preserve_quotes = True
yaml.indent(mapping=4)
yaml.width = 27

data = yaml.load(yaml_str)
assert data["test"]["long"] == 'This is a sample text across two lines.'
yaml.dump(data, sys.stdout)

which gives:

test:
    long: "This is a sample text
        across two lines."

Which looks like what you want.

The the code block around the two changed lines generates correctly loadable string, as you noted. !t just deals very conservatively with potential multiple spaces around the point where a newline is inserted, which is correct for PyYAML, which has no pretentions to preserve the original YAML document, but incorrect for ruamel.yaml. Without those backslahses extra spaces would otherwise disappear during loading.

yaml_str = 'test:\n    long:\n      "This is a sample text  across two lines."'

yaml = ruamel.yaml.YAML()
yaml.Emitter = MyEmitter
yaml.preserve_quotes = True
yaml.indent(mapping=4)
yaml.width = 27

data = yaml.load(yaml_str)
assert data["test"]["long"] == 'This is a sample text  across two lines.'
yaml.dump(data, sys.stdout)

gives:

test:
    long: "This is a sample text\
        \  across two lines."

because of the double spaces.

It doesn't look like the above has other side-effects, but this has not been further tested.

You should take care with using allow_duplicate_keys, it will change your output if you have them, and possible not with the same semantics as another program loading the original document.

You should also consider using the .yaml extension on files containing YAML documents, assuming the other programs using this document can handle that. That has been the recommended extension since at least Septebmer 2006, so I hope some others updated their code since then.