pythonunicodeutf-8character-encodinghindi

Python 3 : Converting UTF-8 unicode Hindi Literal to Unicode


I have a string of UTF-8 literals

'\xe0\xa4\xb9\xe0\xa5\x80 \xe0\xa4\xac\xe0\xa5\x8b\xe0\xa4\xb2' which covnverts to

ही बोल in Hindi. I am unable convert string a to bytes

a = '\xe0\xa4\xb9\xe0\xa5\x80 \xe0\xa4\xac\xe0\xa5\x8b\xe0\xa4\xb2'
#convert a to bytes
#also tried a = bytes(a,'utf-8')
a = a.encode('utf-8')
s = str(a,'utf-8')

The string is converted to bytes but contains wrong unicode literals

RESULT : b'\xc3\xa0\xc2\xa4\xc2\xb9\xc3\xa0\xc2\xa5\xc2\x80 \xc3\xa0\xc2\xa4\xc2\xac\xc3\xa0\xc2\xa5\xc2\x8b\xc3\xa0\xc2\xa4\xc2\xb2' which prints - हॠबà¥à¤²

EXPECTED : It should be b'\xe0\xa4\xb9\xe0\xa5\x80\xe0\xa4\xac\xe0\xa5\x8b\xe0\xa4\xb2 which will be ही बोल


Solution

  • Use the raw-unicode-escape codec to encode the string as bytes, then you can decode as UTF-8.

    >>> s = '\xe0\xa4\xb9\xe0\xa5\x80 \xe0\xa4\xac\xe0\xa5\x8b\xe0\xa4\xb2'
    >>> s.encode('raw-unicode-escape').decode('utf-8')
    'ही बोल'
    

    This is something of a workaround; the ideal solution would be to prevent the source of the data stringifying the original bytes.