Encoding detection in Python, use the chardet library or not?
Asked Answered
F

2

13

I'm writing an app that takes some massive amounts of texts as input which could be in any character encoding, and I want to save it all in UTF-8. I won't receive, or can't trust, the character encoding that comes defined with the data (if any).

I have for a while used Pythons library chardet to detect the original character encoding, http://pypi.python.org/pypi/chardet, but ran into some problems lately where I noticed that it doesn't support Scandinavian encodings (for example iso-8859-1). And apart from that, it takes a huge amount of time/CPU/mem to get results. ~40s for a 2MB text file.

I tried just using the standard Linux file

file -bi name.txt

And with all my files so far it provides me with a 100% result. And this with ~0.1s for a 2MB file. And it supports Scandinavian character encodings as well.

So, I guess the advantages with using file is clear. What are the downsides? Am I missing something?

Foraminifer answered 27/11, 2012 at 19:51 Comment(4)
If it's 100% accurate, then I'm wondering why someone hasn't implemented it (or chardet) using the same rules that file uses... - have you tried a file vs chardet comparison across a significant amount of test data?Dutybound
FWIW, ISO-8859-1 (and its revision, -15) is not just Scandinavian, it's used for many other Latin-based scripts. If the input is "mostly ASCII" and not UTF-8, ISO-8859-1 is a pretty good guess. en.wikipedia.org/wiki/ISO/IEC_8859#The_Parts_of_ISO.2FIEC_8859Synn
Jon, I totally agree. Hence my question. I don't have access to enough data that would make this approach statistically significant, so the answer to your question is no, unfortunately.Foraminifer
Thomas, yes, sorry, you're completely correct. The issue I ran into involved Scandinavian languages, I guess that's why I wrote it as an example. Yes, agree on that it would probably be a good guess, but if there's a fast method that's more accurate - I would prefer to use it.Foraminifer
R
4

Old MS-DOS and Windows formatted files can be detected as unknown-8bit instead of ISO-8859-X, due to not completely standard encondings. Chardet instead will perform an educated guess, reporting a confidence value.

http://www.faqs.org/faqs/internationalization/iso-8859-1-charset/

If you won't handle old, exotic, out-of-standard text files, I think you can use file -i without many problems.

Rika answered 29/11, 2012 at 11:54 Comment(2)
Thanks for your answer, makes sense. Do you have an example of such a file? Old MS-DOS or Windows formatted I mean.Foraminifer
This can be an example i think. It's an old text file from a MS-DOS application, 1988. File -i on my Ubuntu 12.04 detects it as application/octet-stream; charset=binary. There's a wrong character somewhere. I'm not the MASTER ENCONDER, but if you open it with okteta you can see binary data, (09 bytes) everywhere. If there's another explanation let me know, thank you. filebin.ca/OOQ4WVHhaKTRika
E
4

I have found "chared" (http://code.google.com/p/chared/) to be pretty accurate. You can even train new encoding detectors for languages that not supported.

It might be a good alternative when chardet starts acting up.

Elwina answered 20/2, 2013 at 17:33 Comment(2)
Cool, thanks. It seems to have one extra requirement though, you have to know the language used in the text. Usually I don't know that in my app.. But it definitely seems to be a good alternative.Foraminifer
Yes, you need to know the language but you could guess it using for example langid (github.com/saffsd/langid.py).Elwina

© 2022 - 2024 — McMap. All rights reserved.