pythonutf-8noncharacter

Strip invalid and noncharacters from utf8


I'm loading some data, processing it, then sending data to an application which (fair enough) doesn't allow the invalid utf8 noncharacters U+FDD0 through U+FDEF, as well as the invalid U+FFFE and U+FFFF special characters.

My raw data is out of my control, and some it happens to contain invalid characters that I want to clean.

However, my python code is still sending the application invalid utf8, as it doesn't ignore the noncharacters and other invalid characters.

For example b'\xef\xbf\xbf'.decode('utf-8', 'ignore') returns '\uffff' instead of ignoring the invalid character, and encode has the same behaviour.

I first debugged this with U+FFFE, which has a wontfix bug related to the BOM. https://bugs.python.org/issue765036

Then I found this massive email list thread (https://bugs.python.org/issue12729) claiming that it's ok to emit noncharacters because applications may want to keep them for internal use.

However, is there any nice python way to emit 'transmitabble' utf8 without these noncharacters and other invalid chars like U+FFFF?


Solution

  • I haven't fully considered the ramifications of this, however, you could strip out those characters that have a unicode category of "non-character":

    >>> s = '\uffff\ufffeSome string that contains \ufdd0, \ufdd1, \ufdef and \ufdf0'
    >>> print(s)
    Some string that contains ﷐, ﷑, ﷯ and ﷰ
    
    >>> s = ''.join(c for c in s if unicodedata.category(c) != 'Cn')
    >>> print(s)
    Some string that contains , ,  and ﷰ
    

    There is some information about character categories here, and here - scroll down to "Restricted Interchange".

    It looks like it would be risky to strip out reserved codepoints due to the possibility of them becoming assigned in future versions of the Unicode standard. You need to consider whether it is warranted in your particular case and for your application now and in the future.