pythonregexline-endings

Remove duplicate linebreaks in a string


I have some files which could use \r, \n, or \r\n as their line break mode.

I am trying to change all of them to \r\n, and remove consecutive line breaks. In theory, this is easy, and any number of very simple regexes should work.

In practice, though,

text = re.sub(
    reg_exp,
    r'\r\n',
    text)

on this string (showing line-ending characters),

<ul>␍␊
␍␊
<li><a href="#">link</a></li>␍␊
␍␊
<li><a href="#">link</a></li>␍␊
<li><a href="#">link</a></li>␍␊
␍␊
<li><a href="#">link</a></li>␍␊
␍␊
</ul>␍␊

and I cannot figure out why.

Is my regex not matching the \r for some reason?


Solution

  • It turns out the problem was when Python wrote the string back to the Windows file system. It made some unexpected decisions about what to do with line endings. Specifically, it decided that:

    Both zmo and Louis have answers that work in the Python console, as did the code in the question, it turns out.

    For completeness, this is what the write() looked like:

    with open(file_name, 'r+') as f:
        text = f.read()
    
        # text = re.sub(...)
    
        f.seek(0)
        f.write(text)
        f.truncate()