python-3.xyamlruamel.yaml

Preserving multi-line string as is when round-triping in ruamel


Suppose I have a file like so

test:
    long: "This is a sample text
      across two lines."

When I load the file and dump it back with no changes to the file, it changes this document into

test:
    long: "This is a sample text\
      \ across two lines."

While this is correct and doesn't change the actual value, for huge YAML files this creates a lot of diffs and becomes difficult to look at the valid ones.

This is the code I have used so far

import sys
import ruamel.yaml
from pathlib import Path

yaml = ruamel.yaml.YAML()  # defaults to round-trip
yaml.allow_duplicate_keys = True
yaml.preserve_quotes = True
yaml.explicit_start = True
file_name = "ca.yml"

with open(file_name) as fp:
    data = yaml.load(fp)

with open(file_name, 'w') as fp:
    yaml.dump(data, fp)

Could someone help me understand if there are some settings I'll be able to use to achieve this? or in case it's not possible any workarounds to do the same.


Solution

  • This code was added to ruamel.yaml 0.17.23


    I cannot recreate the output so something seems to be missing. In my tests the backslashes went missing, which I expected as I don't recall there is special code for handling newlines in a double quoted scalar, and AFAICT that was only added for folded block style scalars, but that was not the problem.

    There are a few things that are strange to me:

    Playing around a bit I could get your output when I set the .width to a value of 27-32, and that if you don't set preserve_quotes the output doesn't get the backslashes (but also not the quotes):

    import sys
    import ruamel.yaml
    
    yaml_str = """\
    test:
        long: "This is a sample text
          across two lines."
    """
    
    for pq in [True, False]:
        yaml = ruamel.yaml.YAML()  # defaults to round-trip
        yaml.preserve_quotes = pq
        yaml.indent(mapping=4)
        yaml.width = 27
        yaml.allow_duplicate_keys = True
        yaml.explicit_start = True
    
        data = yaml.load(yaml_str)
        # check there are no hidden spaces or newlines in the loaded data
        assert data["test"]["long"] == 'This is a sample text across two lines.'
        yaml.dump(data, sys.stdout)
    

    which gives:

    ---
    test:
        long: "This is a sample text\
            \ across two lines."
    ---
    test:
        long: This is a sample text
            across two lines.
    

    So this seems to have to do specifically with the code that dumps strings with style '"'

    BTW, I can recommend not overwriting the input during this kind of testing, instead write the input from the code if you need to do file-to-file loading/dumping, or use string input and sys.stdout output (when doing visual inspection).

    This garbage is caused by code forked from PyYAML years ago:

    import sys
    import yaml  # PyYAML
    
    data = yaml.safe_load(yaml_str)
    assert data["test"]["long"] == 'This is a sample text across two lines.'
    yaml.safe_dump(data, sys.stdout, indent=4, width=27, default_style='"')
    

    which gives:

    "test":
        "long": "This is a sample\
            \ text across two lines."
    

    and that leads to the code for write_double_quoted in emitter.py:

    class MyEmitter(ruamel.yaml.emitter.Emitter):
        def write_double_quoted(self, text, split=True):
            if self.root_context:
                if self.requested_indent is not None:
                    self.write_line_break()
                    if self.requested_indent != 0:
                        self.write_indent()
            self.write_indicator(u'"', True)
            start = end = 0
            while end <= len(text):
                ch = None
                if end < len(text):
                    ch = text[end]
                if (
                    ch is None
                    or ch in u'"\\\x85\u2028\u2029\uFEFF'
                    or not (
                        u'\x20' <= ch <= u'\x7E'
                        or (
                            self.allow_unicode
                            and (u'\xA0' <= ch <= u'\uD7FF' or u'\uE000' <= ch <= u'\uFFFD')
                        )
                    )
                ):
                    if start < end:
                        data = text[start:end]
                        self.column += len(data)
                        if bool(self.encoding):
                            data = data.encode(self.encoding)
                        self.stream.write(data)
                        start = end
                    if ch is not None:
                        if ch in self.ESCAPE_REPLACEMENTS:
                            data = u'\\' + self.ESCAPE_REPLACEMENTS[ch]
                        elif ch <= u'\xFF':
                            data = u'\\x%02X' % ord(ch)
                        elif ch <= u'\uFFFF':
                            data = u'\\u%04X' % ord(ch)
                        else:
                            data = u'\\U%08X' % ord(ch)
                        self.column += len(data)
                        if bool(self.encoding):
                            data = data.encode(self.encoding)
                        self.stream.write(data)
                        start = end + 1
                if (
                    0 < end < len(text) - 1
                    and (ch == u' ' or start >= end)
                    and self.column + (end - start) > self.best_width
                    and split
                ):
                    # data = text[start:end] + u'\\'  # <<< replaced with following two lines
                    need_backquote = text[end] == u' ' and (len(text) > end) and text[end+1] == u' '
                    data = text[start:end] + (u'\\' if need_backquote else u'')
                    if start < end:
                        start = end
                    self.column += len(data)
                    if bool(self.encoding):
                        data = data.encode(self.encoding)
                    self.stream.write(data)
                    self.write_indent()
                    self.whitespace = False
                    self.indention = False
                    if text[start] == u' ':
                        if not need_backquote:
                            # remove leading space it will load from the newline
                            start += 1 
                        # data = u'\\'    # <<< replaced with following line
                        data = u'\\' if need_backquote else u''
                        self.column += len(data)
                        if bool(self.encoding):
                            data = data.encode(self.encoding)
                        self.stream.write(data)
                end += 1
            self.write_indicator(u'"', False)
    
    yaml = ruamel.yaml.YAML()
    yaml.Emitter = MyEmitter
    yaml.preserve_quotes = True
    yaml.indent(mapping=4)
    yaml.width = 27
    
    data = yaml.load(yaml_str)
    assert data["test"]["long"] == 'This is a sample text across two lines.'
    yaml.dump(data, sys.stdout)
    
    
    

    which gives:

    test:
        long: "This is a sample text
            across two lines."
    

    Which looks like what you want.

    The the code block around the two changed lines generates correctly loadable string, as you noted. !t just deals very conservatively with potential multiple spaces around the point where a newline is inserted, which is correct for PyYAML, which has no pretentions to preserve the original YAML document, but incorrect for ruamel.yaml. Without those backslahses extra spaces would otherwise disappear during loading.

    yaml_str = 'test:\n    long:\n      "This is a sample text  across two lines."'
    
    yaml = ruamel.yaml.YAML()
    yaml.Emitter = MyEmitter
    yaml.preserve_quotes = True
    yaml.indent(mapping=4)
    yaml.width = 27
    
    data = yaml.load(yaml_str)
    assert data["test"]["long"] == 'This is a sample text  across two lines.'
    yaml.dump(data, sys.stdout)
    

    gives:

    test:
        long: "This is a sample text\
            \  across two lines."
    

    because of the double spaces.

    It doesn't look like the above has other side-effects, but this has not been further tested.

    You should take care with using allow_duplicate_keys, it will change your output if you have them, and possible not with the same semantics as another program loading the original document.

    You should also consider using the .yaml extension on files containing YAML documents, assuming the other programs using this document can handle that. That has been the recommended extension since at least Septebmer 2006, so I hope some others updated their code since then.