I'm writing a very small script that can convert latin-1 characters into unicode (I'm a complete beginner in Python).
I tried a method like this:
def latin1_to_unicode(character):
uni = character.decode('latin-1').encode("utf-8")
retutn uni
It works fine for characters that are not specific to the latin-1 set, but if I try the following example:
print latin1_to_Unicode('å')
It returns å
instead of å
. Same goes for other letters like æ
and ø
.
Can anyone please explain why this is happening? Thanks
I have the # -*- coding: utf8 -*-
declaration in my script, if it matters any to the problem
Your source code is encoded to UTF-8, but you are decoding the data as Latin-1. Don't do that, you are creating a Mojibake.
Decode from UTF-8 instead, and don't encode again. print
will write to sys.stdout
which will have been configured with your terminal or console codec (detected when Python starts).
My terminal is configured for UTF-8, so when I enter the å
character in my terminal, UTF-8 data is produced:
>>> 'å'
'\xc3\xa5'
>>> 'å'.decode('latin1')
u'\xc3\xa5'
>>> print 'å'.decode('latin1')
Ã¥
You can see that the character uses two bytes; when saving your Python source with an editor configured to use UTF-8, Python reads the exact same bytes from disk to put into your bytestring.
Decoding those two bytes as Latin-1 produces two Unicode codepoints corresponding to the Latin-1 codec.
You probably want to do some studying on the difference between Unicode and encodings, and how that relates to Python: