[SOLVED] Why does emoji U+1F60A contain a new line character in UTF-16 when reading it?

Why does emoji U+1F60A contain a new line character in UTF-16 when reading it?

We have a file that, when opened with normal file readers, such as Notepad++, the emoji is rendered successfully and no extra new lines are added.

The problem we are facing is that, when opening the same file with Python, the UTF-16 bytes are divided into two lines, messing up our Big Data Processing Framework that reads the file in parallel.

We need to understand what makes it clear to Notepad++ that there is not a real new line in the sequence =\xd8\n\xde so that we can adjust our custom file reader.

STEPS TO REPRODUCE

Copy this emoji 😊 to an empty file and add a new line.

Save the file and open it with Python in bytes format:

# Open the file as bytes:
with open("file_name.csv", "rb") as f:
    for line in f:
        print(line)

You find there is an extra newline character in the middle of the emoji:
```
b'\xff\xfe=\xd8\n'
b'\xde\r\x00\n'
b'\x00'
```

Solution

The UTF-8 bytes of U+1F60A actually are (hexadecimal) f0 9f 98 8a. Note that this does not contain the byte 0A aka \n but 8A.

The UTF-16 (big endian) two-byte chars are: d83d de0a.

The UTF-16LE (little endian) two-byte chars are: 3dd8 0ade.

And here is the error: there is a byte 0a, but the encoding used to read the file is wrong, you are using a byte encoding or such, so it doesn't handle the 0a correctly.