For query:
SHOW VARIABLES LIKE 'char%';
MySQL Database returns:
character_set_client latin1
character_set_connection latin1
character_set_database latin1
character_set_filesystem binary
character_set_results latin1
character_set_server latin1
character_set_system utf8
character_sets_dir /usr/local/mysql-5.7.27-macos10.14-x86_64/share/charsets/
In my Python script:
conn = get_database_connection()
conn.setdecoding(pyodbc.SQL_CHAR, encoding='latin1')
conn.setdecoding(pyodbc.SQL_WCHAR, encoding='latin1')
For one of the columns that has following value:
N’a pas
Python returns:
N?a pas
Between N and a, There is a star shaped question-mark. How do I read it as is? What's the best way to handle it? I have been reading about converting my db to utf-8
but that seems like a long shot with a good chance of breaking other things. Is there a more efficient way to do it?
At some of the places in code, I have done :
value = value.encode('utf-8', 'ignore').decode('utf-8')
to handle utf-8
data like accented characters but apostrophe
did not get handled with the same and I ended up with ?
instead of '
Converting the database to UTF-8 is better for the long run, but risky because you may break other things like you say. What you can do is change the database connection encoding to UTF-8. That way you get UTF-8 encoded strings out of the database, without having changed how the data is actually stored.
conn.setdecoding(pyodbc.SQL_CHAR, encoding='utf8')
conn.setdecoding(pyodbc.SQL_WCHAR, encoding='utf8')
If that seems too risky, but you could consider having two separate database connections, the original and one in utf8
, and migrate the app to using utf8
little by little, as you have time to test.
If even that seems too risky, maybe try using a character encoding that's more similar to mysql's version of latin1. MySQL's "latin1" is actually an extended version of cp1252 encoding, which itself is a Microsoft extension of the "standard latin1" that's used in Python (among others).
conn.setdecoding(pyodbc.SQL_CHAR, encoding='cp1252')
conn.setdecoding(pyodbc.SQL_WCHAR, encoding='cp1252')