I'm writing an app that takes some massive amounts of texts as input which could be in any character encoding, and I want to save it all in UTF-8. I won't receive, or can't trust, the character encoding that comes defined with the data (if any).
I have for a while used Pythons library chardet to detect the original character encoding, http://pypi.python.org/pypi/chardet, but ran into some problems lately where I noticed that it doesn't support Scandinavian encodings (for example iso-8859-1). And apart from that, it takes a huge amount of time/CPU/mem to get results. ~40s for a 2MB text file.
I tried just using the standard Linux file
file -bi name.txt
And with all my files so far it provides me with a 100% result. And this with ~0.1s for a 2MB file. And it supports Scandinavian character encodings as well.
So, I guess the advantages with using file is clear. What are the downsides? Am I missing something?
chardet
) using the same rules thatfile
uses... - have you tried afile
vschardet
comparison across a significant amount of test data? – Dutybound