[SOLVED] Python UTF-8 Latin-1 displays wrong character

Python UTF-8 Latin-1 displays wrong character

I'm writing a very small script that can convert latin-1 characters into unicode (I'm a complete beginner in Python).

I tried a method like this:

def latin1_to_unicode(character):

    uni = character.decode('latin-1').encode("utf-8")
    retutn uni

It works fine for characters that are not specific to the latin-1 set, but if I try the following example:

print latin1_to_Unicode('å')

It returns Ã¥ instead of å. Same goes for other letters like æ and ø.

Can anyone please explain why this is happening? Thanks

I have the # -*- coding: utf8 -*- declaration in my script, if it matters any to the problem

Solution

Your source code is encoded to UTF-8, but you are decoding the data as Latin-1. Don't do that, you are creating a Mojibake.

Decode from UTF-8 instead, and don't encode again. print will write to sys.stdout which will have been configured with your terminal or console codec (detected when Python starts).

My terminal is configured for UTF-8, so when I enter the å character in my terminal, UTF-8 data is produced:

>>> 'å'
'\xc3\xa5'
>>> 'å'.decode('latin1')
u'\xc3\xa5'
>>> print 'å'.decode('latin1')
Ã¥

You can see that the character uses two bytes; when saving your Python source with an editor configured to use UTF-8, Python reads the exact same bytes from disk to put into your bytestring.

Decoding those two bytes as Latin-1 produces two Unicode codepoints corresponding to the Latin-1 codec.

You probably want to do some studying on the difference between Unicode and encodings, and how that relates to Python:

The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky
Pragmatic Unicode by Ned Batchelder
The Python Unicode HOWTO