I have the string 'abç'
which in UTF-8 is b'ab\xc3\xa7'
.
I want it in UTF-16, but not this way:
b'ab\xc3\xa7'.decode('utf-8').encode('utf-16-be')
which gives me:
b'\x00a\x00b\x00\xe7'
The answer I want is the UTF-16 code units, that is, a list of int:
[32, 33, 327]
Is there any straightforward way to do that?
And of course, the reverse. Given a list of ints which are UTF-16 code units, how do I convert that to UTF-8?
The simple solution that may work in many cases would be something like:
def sort_of_get_utf16_code_units(s):
return list(map(ord, s))
print(sort_of_get_utf16_code_units('abç')
Output:
[97, 98, 231]
However, that doesn't work for characters outside the Basic Multilingual Plane (BMP):
print(sort_of_get_utf16_code_units('😊'))
Output is the Unicode code point:
[128522]
Where you might have expected the code units (as your question states):
[55357, 56842]
To get that:
def get_utf16_code_units(s):
utf16_bytes = s.encode('utf-16-be')
return [int.from_bytes(utf16_bytes[i:i+2]) for i in range(0, len(utf16_bytes), 2)]
print(get_utf16_code_units('😊'))
Output:
[55357, 56842]
Doing the reverse is similar:
def utf16_code_units_to_string(code_units):
utf16_bytes = b''.join([unit.to_bytes(2, byteorder='big') for unit in code_units])
return utf16_bytes.decode('utf-16-be')
print(utf16_code_units_to_string([55357, 56842]))
Output:
😊
The byteorder is 'big'
by default, but it doesn't hurt to be specific there.