pythonencodingchardet

Encoding detection in Python, use the chardet library or not?


I'm writing an app that takes some massive amounts of texts as input which could be in any character encoding, and I want to save it all in UTF-8. I won't receive, or can't trust, the character encoding that comes defined with the data (if any).

I have for a while used Pythons library chardet to detect the original character encoding, http://pypi.python.org/pypi/chardet, but ran into some problems lately where I noticed that it doesn't support Scandinavian encodings (for example iso-8859-1). And apart from that, it takes a huge amount of time/CPU/mem to get results. ~40s for a 2MB text file.

I tried just using the standard Linux file

file -bi name.txt

And with all my files so far it provides me with a 100% result. And this with ~0.1s for a 2MB file. And it supports Scandinavian character encodings as well.

So, I guess the advantages with using file is clear. What are the downsides? Am I missing something?


Solution

  • Old MS-DOS and Windows formatted files can be detected as unknown-8bit instead of ISO-8859-X, due to not completely standard encondings. Chardet instead will perform an educated guess, reporting a confidence value.

    http://www.faqs.org/faqs/internationalization/iso-8859-1-charset/

    If you won't handle old, exotic, out-of-standard text files, I think you can use file -i without many problems.