pythonencodingutf-8iso-8859-1mojibake

Encoding Issue with data from MySQL database


I have a sql database that has encoding issues, so it's returning me a result that is similar to this:

"Cuvée"

From what I can tell this is because it was encoded to latin-1 when it should have been encoded to UTF-8 (please correct me if I'm wrong). I'm processing these results in a Python script and have been getting a few encoding problems and have been unable to convert it back to what it's supposed to be:

"Cuvée"

I'm using Python 3.3 but by using codecs.decode to make the change from latin1 to utf-8 I'm getting:

'str' does not support the buffer interface

I think I've tried everything I found to no avail. I'm not really keen on going to Python 2.7 because I've written the rest of the script on 3.3 and it will be quite a pain to rewrite. Is there a way to do this that I am unaware of?


Solution

  • Yes, you have what is called a Mojibake; it could be Latin-1, or it could be Windows Codepage 1252 or another closely related codec.

    You could just try to encode as Latin-1, then decode again:

    faulty_text.encode('latin1').decode('utf8')
    

    However, sometimes, especially with CP1252 Mojibakes, the faulty encoding results in text that cannot legally be encoded back to bytes, because some UTF-8 bytes were 'decoded' forcefully even though the codec doesn't support those bytes.

    Your best bet is to install the ftfy library, which can automatically fix such Mojibake mistakes for you. It includes special codecs to undo CP1252 Mojibakes properly (as well as other related codepages), codecs that bypass the aforementioned problems.