pythonpython-3.xunicodesurrogate-pairs

How to convert surrogate pairs into hexadecimal, and vice-versa in Python?


How would I convert characters which are surrogate pairs into hexadecimal?

I've found that using hex() and ord() works for characters with a single code point, such as emojis like "šŸ˜€". For example:

print(hex(ord("šŸ˜€")))
# '0x1f600'

Similarly, using chr() and int() works for getting the characters from the hexadecimal:

print(chr(int(0x1f600)))
# 'šŸ˜€'

However, as soon as I use a surrogate pair, such as an emoji like "šŸ‘©šŸ»", the code throws an error:

print(hex(ord("šŸ‘©šŸ»")))
TypeError: ord() expected a character, but string of length 2 found

How would I fix this, and how would I convert such hexadecimal back into a character?


Solution

  • Since an exact output format wasn't specified, how about:

    def hexify(s):
        return s.encode('utf-32-be').hex(sep=' ', bytes_per_sep=4)
    
    def unhexify(s):
        return bytes.fromhex(s).decode('utf-32-be')
    
    s = hexify('šŸ‘©šŸ»')
    print(s)
    print(unhexify(s))
    

    Output:

    0001f469 0001f3fb
    šŸ‘©šŸ»
    

    Or similar to your original code:

    def hexify(s):
        return [hex(ord(c)) for c in s]
    
    def unhexify(L):
        return ''.join([chr(int(n,16)) for n in L])
    
    s = hexify('šŸ‘©šŸ»')
    print(s)
    print(unhexify(s))
    

    Output:

    ['0x1f469', '0x1f3fb']
    šŸ‘©šŸ»