python-3.xpython-unicode

Why does emoji U+1F60A contain a new line character in UTF-16 when reading it?


We have a file that, when opened with normal file readers, such as Notepad++, the emoji is rendered successfully and no extra new lines are added.

The problem we are facing is that, when opening the same file with Python, the UTF-16 bytes are divided into two lines, messing up our Big Data Processing Framework that reads the file in parallel.

We need to understand what makes it clear to Notepad++ that there is not a real new line in the sequence =\xd8\n\xde so that we can adjust our custom file reader.

image

STEPS TO REPRODUCE


Solution

  • The UTF-8 bytes of U+1F60A actually are (hexadecimal) f0 9f 98 8a. Note that this does not contain the byte 0A aka \n but 8A.

    The UTF-16 (big endian) two-byte chars are: d83d de0a.

    The UTF-16LE (little endian) two-byte chars are: 3dd8 0ade.

    And here is the error: there is a byte 0a, but the encoding used to read the file is wrong, you are using a byte encoding or such, so it doesn't handle the 0a correctly.