pythonpython-3.xunicodepython-unicode

Current idiom for removing 'surrogateescape' characters fron a decoded string


Armin Ronacher, http://lucumr.pocoo.org/2013/7/2/the-updated-guide-to-unicode/

If you for instance pass [the result of os.fsdecode() or equivalent] to a template engine you [sometimes get a UnicodeEncodeError] somewhere else entirely and because the encoding happens at a much later stage you no longer know why the string was incorrect. If you detect that error when it happens the issue becomes much easier to debug

Armin suggests a function

def remove_surrogate_escaping(s, method='ignore'):
    assert method in ('ignore', 'replace'), 'invalid removal method'
    return s.encode('utf-8', method).decode('utf-8')

Nick Coghlan, 2014, [Python-Dev] Cleaning up surrogate escaped strings

The current proposal on the issue tracker is to ... take advantage of the existing error handlers:

def convert_surrogateescape(data, errors='replace'):
    return data.encode('utf-8', 'surrogateescape').decode('utf-8', errors)

That code is short, but semantically dense - it took a few iterations to come up with that version. (Added bonus: once you're alerted to the possibility, it's trivial to write your own version for existing Python 3 versions. The standard name just makes it easier to look up when you come across it in a piece of code, and provides the option of optimising it later if it ever seems worth the extra work)

The functions are slightly different. The second was written with knowledge of the first.

Since Python 3.5, the backslashreplace error handler now works on decoding as well as encoding. The first approach is not designed to use backslashreplace e.g. an error decoding the byte 0xff would get printed as "\udcff". The second approach is designed to solve this; it would print "\xff".

If you did not need backslashreplace, you might prefer the first version if you had the misfortune to be supporting Python < 3.5 (including polyglot 2/3 code, ouch).

Question

Is there a better idiom for this purpose yet? Or do we still use this drop-in function?


Solution

  • Nick referred to an issue for adding such a function to the codecs module. As of 2024, the function has not been added, and the ticket remains open.


    The latest comment says

    Nick Coghlan, 2018

    A recent discussion on python-ideas also introduced me to the third party library, "ftfy", which offers a wide range of tools for cleaning up improperly decoded data.

    That includes a lone surrogate fixer: ftfy.fixes.fix_surrogates(text)

    ...

    I do not find the function in ftfy appealing. The documentation does not say so, but it appears to be designed to handle both surrogateescape and ... be part of a workaround for CESU-8, or something like that?

    Replace 16-bit surrogate codepoints with the characters they represent (when properly paired), or with � otherwise.