pythonmysqlutf-8asciilatin

Character encoding and decoding in Python with MySQL


For query:

SHOW VARIABLES LIKE 'char%';

MySQL Database returns:

character_set_client    latin1
character_set_connection    latin1
character_set_database  latin1
character_set_filesystem    binary
character_set_results   latin1
character_set_server    latin1
character_set_system    utf8
character_sets_dir  /usr/local/mysql-5.7.27-macos10.14-x86_64/share/charsets/

In my Python script:

conn = get_database_connection()
conn.setdecoding(pyodbc.SQL_CHAR, encoding='latin1')
conn.setdecoding(pyodbc.SQL_WCHAR, encoding='latin1')

For one of the columns that has following value:

N’a pas

Python returns:

N?a pas

Between N and a, There is a star shaped question-mark. How do I read it as is? What's the best way to handle it? I have been reading about converting my db to utf-8 but that seems like a long shot with a good chance of breaking other things. Is there a more efficient way to do it?

At some of the places in code, I have done :

value = value.encode('utf-8', 'ignore').decode('utf-8')

to handle utf-8 data like accented characters but apostrophe did not get handled with the same and I ended up with ? instead of '


Solution

  • Converting the database to UTF-8 is better for the long run, but risky because you may break other things like you say. What you can do is change the database connection encoding to UTF-8. That way you get UTF-8 encoded strings out of the database, without having changed how the data is actually stored.

    conn.setdecoding(pyodbc.SQL_CHAR, encoding='utf8')
    conn.setdecoding(pyodbc.SQL_WCHAR, encoding='utf8')
    

    If that seems too risky, but you could consider having two separate database connections, the original and one in utf8, and migrate the app to using utf8 little by little, as you have time to test.

    If even that seems too risky, maybe try using a character encoding that's more similar to mysql's version of latin1. MySQL's "latin1" is actually an extended version of cp1252 encoding, which itself is a Microsoft extension of the "standard latin1" that's used in Python (among others).

    conn.setdecoding(pyodbc.SQL_CHAR, encoding='cp1252')
    conn.setdecoding(pyodbc.SQL_WCHAR, encoding='cp1252')