pythonutf-8character-encodingstring-decoding

print (encoding/edcoding) French characters works in txt file but incorrect in excel/csv [python]


I was given a string with special characters (which should be french characters) and want to make it display correctly in csv/excel:

s1 = 'Benoît'

# take a look at encoding
print(s1.encode(encoding='utf-8'))

# print to txt
with open("firstname.txt", "w") as text_file:
    print(s1, file=text_file)

# print to csv
import pandas as pd
df = pd.DataFrame({'FirstName': [s1]})
df.to_csv('firstname.csv', index = False)

The results txt file displays the French correctly, but the csv doesn't.

enter image description here

My question is how to made csv display it correctly? (I can copy the French character from txt to csv, but how to programmatically write the csv and display it correctly?)


UPDATE: thanks to @snakecharmerb and i tried encoding = 'utf-8-sig'

# try csv with encoding = 'utf-8-sig': doesn't work
df = pd.DataFrame({'a': [s1]})
df.to_csv('firstname.csv', index = False, encoding = 'utf-8-sig')



# read from txt file which seems to display correctly
df = pd.read_table("firstname.txt", header = None)
df
#         0
# 0  Benoît

# then write to csv with encoding = 'utf-8-sig' - works
df.to_csv('firstname1.csv', index = False, encoding = 'utf-8-sig')

enter image description here


Solution

  • Excel will not necessarily recognise that the file is encoded as UTF-8. You can specify UTF-8 as the encoding when opening the file in Excel, or you can write the csv file with the 'utf-8-sig' encoding.

    'utf-8-sig' is a Windows-specific version of Excel which inserts a a three character "byte order mark" (BOM) and the beginning of the file. Windows applications trying to guess a file's encoding will read the BOM and decode the file from UTF-8. The BOM may not be recognised on other platforms, resulting in three unusual characters appearing at the beginning of the file.