How Can I Best Guess the Encoding when the BOM (Byte Order Mark) is Missing?
Asked Answered
H

4

29

My program has to read files that use various encodings. They may be ANSI, UTF-8 or UTF-16 (big or little endian).

When the BOM (Byte Order Mark) is there, I have no problem. I know if the file is UTF-8 or UTF-16 BE or LE.

I wanted to assume when there was no BOM that the file was ANSI. But I have found that the files I am dealing with often are missing their BOM. Therefore no BOM may mean that the file is ANSI, UTF-8, UTF-16 BE or LE.

When the file has no BOM, what would be the best way to scan some of the file and most accurately guess the type of encoding? I'd like to be right close to 100% of the time if the file is ANSI and in the high 90's if it is a UTF format.

I'm looking for a generic algorithmic way to determine this. But I actually use Delphi 2009 which knows Unicode and has a TEncoding class, so something specific to that would be a bonus.


Answer:

ShreevatsaR's answer led me to search on Google for "universal encoding detector delphi" which surprised me in having this post listed in #1 position after being alive for only about 45 minutes! That is fast googlebotting!! And also amazing that Stackoverflow gets into 1st place so quickly.

The 2nd entry in Google was a blog entry by Fred Eaker on Character encoding detection that listed algorithms in various languages.

I found the mention of Delphi on that page, and it led me straight to the Free OpenSource ChsDet Charset Detector at SourceForge written in Delphi and based on Mozilla's i18n component.

Fantastic! Thank you all those who answered (all +1), thank you ShreevatsaR, and thank you again Stackoverflow, for helping me find my answer in less than an hour!

Hillier answered 16/12, 2008 at 22:59 Comment(1)
Thanks for the edit! ChsDet seems to be working!Aftergrowth
H
9

Maybe you can shell out to a Python script that uses Chardet: Universal Encoding Detector. It is a reimplementation of the character encoding detection that used by Firefox, and is used by many different applications. Useful links: Mozilla's code, research paper it was based on (ironically, my Firefox fails to correctly detect the encoding of that page), short explanation, detailed explanation.

Horrified answered 16/12, 2008 at 22:59 Comment(4)
Ooooh. That's exactly the type of algorithm I'm looking for. Now if I could figure out how it works, or just find a Delphi equivalent ...Hillier
According to the docs, it's a Python port of Mozilla cpp code. The latter is located at mxr.mozilla.org/seamonkey/source/extensions/universalchardet/… No idea which incarnation is easier to port though!Nelly
(contd.) The CPP version seems to be more amply commented, which might help in porting.Nelly
All links died. Can you try to restore them?Unsparing
A
5

Here is how notepad does that

There is also the python Universal Encoding Detector which you can check.

Abagail answered 16/12, 2008 at 23:13 Comment(3)
The IsTextUnicode is a good first step. Then it says it uses ietf.org/rfc/rfc2279.txt?number=2279 for the UTF-8 definition, but that doesn't say what to test.Hillier
Actually, WP, it's en.wikipedia.org/wiki/Bush_hid_the_facts (some jokes do have to be explained).Rootlet
Actually my version is "MS hid the facts" (without quotation marks of course). Try it.Petromilli
P
4

My guess is:

  • First, check if the file has byte values less than 32 (except for tab/newlines). If it does, it can't be ANSI or UTF-8. Thus - UTF-16. Just have to figure out the endianness. For this you should probably use some table of valid Unicode character codes. If you encounter invalid codes, try the other endianness if that fits. If either fit (or don't), check which one has larger percentage of alphanumeric codes. Also you might try searchung for line breaks and determine endianness from them. Other than that, I have no ideas how to check for endianness.
  • If the file contains no values less than 32 (apart from said whitespace), it's probably ANSI or UTF-8. Try parsing it as UTF-8 and see if you get any invalid Unicode characters. If you do, it's probably ANSI.
  • If you expect documents in non-English single-byte or multi-byte non-Unicode encodings, then you're out of luck. Best thing you can do is something like Internet Explorer which makes a histogram of character values and compares it to histograms of known languages. It works pretty often, but sometimes fails too. And you'll have to have a large library of letter histograms for every language.
Property answered 16/12, 2008 at 23:11 Comment(3)
Hmmm, I often see bytes with values less than 32 in my text files. Things like \n, \r and \t. Rarely some other ones, too.Subrogate
ASCII, most ANSI code pages, and UTF-8 understand characters such as carriage return, line feed, horizontal tab, null character, etc., which have byte values less than 32.Petromilli
I meant to say ANSI, not ASCII in the question. I've modified the question now. You might want to modify your answer to reflect this.Hillier
C
1

ASCII? No modern OS uses ASCII any more. They all use 8 bit codes, at least, meaning it's either UTF-8, ISOLatinX, WinLatinX, MacRoman, Shift-JIS or whatever else is out there.

The only test I know of is to check for invalid UTF-8 chars. If you find any, then you know it can't be UTF-8. Same is probably possible for UTF-16. But when it's no Unicode set, then it'll be hard to tell which Windows code page it might be.

Most editors I know deal with this by letting the user choose a default from the list of all possible encodings.

There is code out there for checking validity of UTF chars.

Collateral answered 16/12, 2008 at 23:10 Comment(3)
Windows still has device drivers. If your kernel code isn't 7 bit clean you'll regret it.Petromilli
@Windows programmer: what do you mean kernel code needs to be 7-bit clean? Most (all?) drivers need to deal with Unicode - although sometimes the problem is correctly converting from MBCS to Unicode (do I use OEM or the default codepage?, etc).Subrogate
OK, code that handles filenames has to copy and convert character strings in variables (PUNICODE etc.), but the source code still has to be 7-bit clean in order to compile properly at compile time.Petromilli

© 2022 - 2024 — McMap. All rights reserved.