pythoncsvutf-8non-englishbullseye

Python 3.9.x created CSV with non-English (Unicode) characters (UTF-8 encoded) does not show correctly when opened in Excel (Windows)


My original Python 2.7 code that created the CSV file with non-English characters used the NOT recommended hack of:

reload(sys)
sys.setdefaultencoding('utf8')

In order to achieve "UTF-8" compatibility. (changed from ASCII).

In addition , I've added the BOM (Byte Order Marker) of "UTF-8" to the file content start so Excel will open it as such:

fp.write("\xEF\xBB\xBF")

And it worked great , all non-English characters were presented in windows Excel perfectly.

This was the CSV creation code I used: (rows is the array of an SQL query)

filename= "example.csv"
fp = open("%s" % filename , 'w')
fp.write("\xEF\xBB\xBF")
myFile = csv.writer(fp) 
myFile.writerows(rows)
fp.close()

Now , when I moved to Python 3.9.x (In Raspbian Bullseye) , that "hack" no longer worked... due to many reasons which I can elaborate if needed but the main thing which surprised me is that the DEFAULT python 3.9.x encoding was already "UTF-8" ... so it wasn't needed now.

BTW - the way check which is the default encoding is to type in the terminal the following command:

python -c "import sys; print(sys.getdefaultencoding())"

The CSV was created but weird characters were displayed in Windows (Excel).

I tried to remove the BOM file start "\xEF\xBB\xBF" (since I figured it's not needed anymore due to the default UTF-8 encoding) and thought all will be good... but it wasn't , I got weird characters while opening the CSV in Windows Excel.


Solution

  • After some extensive research which included many trial and errors ,I found the answer.

    1. My removal of the BOM "\xEF\xBB\xBF" characters was correct , it's not needed.
    2. The missing bit was adding the RIGHT encoding to the "open()" file part of the code , I was missing the (encoding="utf-8-sig").

    Code that solved the issue:

    fp = open("%s" % filename , 'w' , encoding="utf-8-sig" )
    

    Encoding with "UTF-8" only didn't work.

    Encoding with "UTF-8-sig" (signature) did the trick since this encoding EXPECTS the BOM characters and discard them from the file content, they are not needed here since it's already in the file's metadata due to the default sys encoding.

    Hope it will help someone :)