pythonemailcharacter-encodingshift-jis

How to detect and correct the Content-Type charset in email header in python?


What is the correct way to programatically detect and correct the Content-Type charset in an email header in python?

I have a 1000s of emails extracted to .eml (basically plain text) files and some are encoded shift_jis, but the charset in the email header doesn't mention this, so they don't display correctly in any email program. Adding in the charset manually to the Content-Type header corrects this.

Was:

Content-Type: text/plain; format=flowed

Needs to be:

Content-Type: text/plain; charset="shift_jis"; format=flowed

What's the correct way to do this in python preserving the email body and other parts of the header?

Also, is there a way to detect which encoding, and only correct those with that encoding? I can't just convert all blindly, since some are iso_2022_jp, and those are already displaying correctly.


Solution

  • With get_charset you can get the pre-existing charset of a message. Here's a sample:

    from email import message_from_file
    msg = message_from_file(open('path.eml'))
    msg.get_charsets()
    [None, 'gb2312', None]
    

    With this approach you can loop through all messages, and using set_charset() set it to the ones that don't have it to the correct one.