pythonstringutf-8literals

Convert UTF-8 to string literals in Python


I have a string in UTF-8 format but not so sure how to convert this string to it's corresponding character literal. For example I have the string:

My string is: 'Entre\xc3\xa9'

Example one:

This code:

u'Entre\xc3\xa9'.encode('latin-1').decode('utf-8')

returns the result: u'Entre\xe9'

If I then continue by printing this:

print u'Entre\xe9'

I get the result: Entreé

This is great and close to what I need. The problem is, I can't make 'Entre\xc3\xa9' a variable and pass it through the steps as this now breaks. Any tips for getting this working?

Example:

a = 'Entre\xc3\xa9'
b = 'u'+ a.encode('latin-1').decode('utf-8')
c= 'u'+ b

I would like result of "c" to be:

Entreé

Solution

  • The u'' syntax only works for string literals, e.g. defining values in source code. Using the syntax results in a unicode object being created, but that's not the only way to create such an object.

    You cannot make a unicode value from a byte string by adding u in front of it. But if you called str.decode() with the right encoding, you get a unicode value. Vice-versa, you can encode unicode objects to byte strings with unicode.encode().

    Note that when displaying a unicode object, Python represents it by using the Unicode string literal syntax again (so u'...'), to ease debugging. You can paste the representation back in to a Python interpreter and get an object with the same value.

    Your a value is defined using a byte string literal, so you only need to decode:

    a = 'Entre\xc3\xa9'
    b = a.decode('utf8')
    

    Your first example created a Mojibake, a Unicode string containing Latin-1 codepoints that actually represent UTF-8 bytes. This is why you had to encode to Latin-1 first (to undo the Mojibake), then decode from UTF-8.

    You may want to read up on Python and Unicode in the Unicode HOWTO. Other articles of interest are: