pythoncsvutf-8facebook-messenger

Trouble with writing to a csv using utf-8 encoding


I'm trying to ananalyse some facebook messenger data and I'm having trouble with utf-8 encoding.

import os
import json
import datetime
from tqdm import tqdm
import csv
from datetime import datetime 

directory = "facebook-100071636101603/messages/inbox"
folders = os.listdir(directory)

if ".DS_Store" in folders:
    folders.remove(".DS_Store")

for folder in tqdm(folders):
    print(folder)
    for filename in os.listdir(os.path.join(directory,folder)):
        if filename.startswith("message"):
            data = json.load(open(os.path.join(directory,folder,filename), "r"))
            for message in data["messages"]:
                try:
                    date = datetime.fromtimestamp(message["timestamp_ms"] / 1000).strftime("%Y-%m-%d %H:%M:%S")
                    sender = message["sender_name"]
                    content = message["content"]
                    with open('output.csv', 'w', encoding="utf-8") as csv_file:
                        writer = csv.writer(csv_file)
                        writer.writerow([date,sender,content])

                except KeyError:
                    pass

This script works but the output csv doesn't show the accentuated characters.

I'm very knew to this so I haven't tried a lot. I've read the Python csv documentation and found this passage:

Since open() is used to open a CSV file for reading, the file will by default be decoded into unicode using the system default encoding (see locale.getencoding()). To decode a file using a different encoding, use the encoding argument of open:

import csv with open('some.csv', newline='', encoding='utf-8') as f: reader = csv.reader(f) for row in reader: print(row)

But this doesn't seems to work.

Edit : This is the output I'm getting but it should be Jørn and not Jørn and quête, not quête.


Solution

  • Try adding encoding="utf-8 to this line:

    json.load(open(os.path.join(directory,folder,filename), "r", encoding="utf-8"))
    

    This will ensure that every file you import is in the utf-8 encoding format

    EDIT:

    You need to install ftfy using pip install ftfy. This package will fix your broken encoding. Change sender and content to fix the encoding using ftfy by writing this:

    import ftfy
    # Your other code
    sender = message["sender_name"]
    content = message["content"]
    sender = ftfy.fix_text(sender)
    content = ftfy.fix_text(content)
    

    You can use ftfy.fix_text(string) for any other broken encoding as well.