Detecting 'text' file type (ANSI vs UTF-8)

L

9

11

I wrote an application (a psychological testing exam) in Delphi (7) which creates a standard text file - ie the file is of type ANSI.

Someone has ported the program to run on the Internet, probably using Java, and the resulting text file is of type UTF-8.

The program which reads these results files will have to read both the files created by Delphi and the files created via the Internet.

Whilst I can convert the UTF-8 text to ANSI (using the cunningly named function UTF8ToANSI), how can I tell in advance which kind of file I have?

Seeing as I 'own' the file format, I suppose the easiest way to deal with this would be to place a marker within the file at a known position which will tell me the source of the program (Delphi/Internet), but this seems to be cheating.

Thanks in advance.

Lubalubba answered 5/2, 2011 at 16:11 Comment(6)

Putting a marker indicating the encoding is not cheating, it's fairly standard (XML does it). The question is rather if converting your old files is a problem. – Udine 5/2, 2011 at 16:27

if you own the file then just put a bom in and it's all good – Thenceforth 5/2, 2011 at 16:34

Make your own format use UTF-8 for new files too. Using a locale dependent charset leads to many horrors. – Nopar 5/2, 2011 at 16:51

A BOM can mess up applications, I would never add one to a UTF-8 encoded file - unless I am forced to :) – Contravene 5/2, 2011 at 17:20

A textfile can be both ANSI and UTF8 if it sticks to the ASCII subset – Amourpropre 5/2, 2011 at 20:22

It seems that the Internet file does have a BOM so I'm going to check for this first before using the UTF8ToANSI function. Thanks to all. – Psychodrama 6/2, 2011 at 6:9

C

2

If the UTF file begins with the UTF-8 Byte-Order Mark (BOM), this is easy:

function UTF8FileBOM(const FileName: string): boolean;
var
  txt: file;
  bytes: array[0..2] of byte;
  amt: integer;
begin

  FileMode := fmOpenRead;
  AssignFile(txt, FileName);
  Reset(txt, 1);

  try
    BlockRead(txt, bytes, 3, amt);
    result := (amt=3) and (bytes[0] = $EF) and (bytes[1] = $BB) and (bytes[2] = $BF);
  finally    
    CloseFile(txt);
  end;

end;

Otherwise, it is much more difficult.

Corena answered 5/2, 2011 at 16:15 Comment(15)

What if the BOM has a valid interpretation in the "ANSI" character set? – Udine 5/2, 2011 at 16:18

Finding a BOM on UTF-8 data is pretty rare, as UTF-8 is endianness-agnostic and hence doesn't require a BOM to determine byte order. – Decennium 5/2, 2011 at 16:20

@larsmans "ANSI" is usually just an alias for "Windows-1252". So yes, the BOM does have a valid interpretation in "ANSI"... – Decennium 5/2, 2011 at 16:20

@dkarp: Yes, and we all know how the BOM looks when interprated in 1252. – Corena 5/2, 2011 at 16:21

@Decennium your points are all correct but seem irrelevant to this question – Thenceforth 5/2, 2011 at 16:24

@Andreas Oh, yeah. But still, -1 to this answer. You really can't count on having a BOM in UTF-8 data. A good answer would try to test if the data is valid UTF-8... – Decennium 5/2, 2011 at 16:25

@David: This answer basically says "Look for the BOM." (And that's all the code does.) Except that 9 times out of 10, a UTF-8 file doesn't have a BOM since it doesn't need a BOM... – Decennium 5/2, 2011 at 16:27

@dkarp: Well, I did write " If the UTF file begins with the UTF-8 Byte-Order Mark (BOM), this is easy:". Thus, I gave a sufficient (well) condition, but not a necessary one. (I even ended my answer with "Otherwise, it is much more difficult.".) I was unaware of any rule saying that an answer at SO had to be complete in order to be useful... – Corena 5/2, 2011 at 16:27

@Decennium the word ANSI as Microsoft means the local legacy charset, and can differ from system to system depending on the OS language. – Nopar 5/2, 2011 at 16:50

@CodeInChaos That makes a lot more sense, actually. Thanks! – Decennium 5/2, 2011 at 16:51

I must be really thick, for I cannot see why anyone even moderately sober would downvote this... – Corena 5/2, 2011 at 19:4

@Andreas While I don't like relying on BOM, I don't think this deserves a downvote. But it looks like somebody downvoted all answers to this thread. – Nopar 5/2, 2011 at 19:8

@Andreas I'm one of the 2 downvotes, and I thought I explained why. 90+% of the time, your answer simply isn't helpful, as UTF-8 files very rarely have a BOM. It's kind of like answering "How do I replicate MySQL's utf8_unicode_ci in Java?" by saying "Well, if both strings are empty, you return 0. Otherwise, it is much more difficult." Yes, that's true. But not helpful. – Decennium 5/2, 2011 at 20:20

I'm accepting this answer as the file created by the Internet version of the program does indeed have a BOM - its first three characters are EF BB BF. I'm going to ask the person who created the Internet version to create some more files so I can check this more thoroughly. Thanks to all who participated. – Psychodrama 6/2, 2011 at 6:8

+1, as Andreas states the answer is correct and the conditions are stated. No reason at all to downvote it! – Darned 6/2, 2011 at 8:46

T

22

There is no 100% sure way to recognize ANSI (e.g. Windows-1250) encoding from UTF-8 encoding. There are ANSI files which cannot be valid UTF-8, but every valid UTF-8 file might as well be a different ANSI file. (Not to mention ASCII-only data, which are both ANSI and UTF-8 by definition, but that is purely a theoretical aspect.)

For instance, the sequence C4 8D might be the “č” character in UTF-8, or it might be “ÄŤ” in windows-1250. Both are possible and correct. However, e.g. 8D 9A can be “Ťš” in windows-1250, but it is not a valid UTF-8 string.

You have to resort to some kind of heuristic, e.g.

If the file contains a sequence which cannot be a valid UTF-8, assume it is ANSI.
Otherwise, if the file begins with UTF-8 BOM (EF BB BF), assume it is UTF-8 (it might not be, however, plain text ANSI file beginning with such characters is very improbable).
Otherwise, assume it is UTF-8. (Or, try more heuristics, maybe using the knowledge of the language of the text, etc.)

See also the method used by Notepad.

Transubstantiation answered 5/2, 2011 at 16:28 Comment(2)

+1, Though I would exclude UTF-16LE and UTF-16BE based on their byte order marks and optionally on zero/non-zero alternating byte occurences, BEFORE deciding on UTF-8... – Nicholasnichole 5/2, 2011 at 18:13

@Marjan - Well, the specific question here was about distinguishing ANSI from UTF-8; UTF-16 is not expected at all. But in the generic case, you are right, there are many more questions to ask. (And the IsTextUnicode method mentioned in the link would help with that UTF-16 case.) – Transubstantiation 6/2, 2011 at 13:55

C

2

If the UTF file begins with the UTF-8 Byte-Order Mark (BOM), this is easy:

function UTF8FileBOM(const FileName: string): boolean;
var
  txt: file;
  bytes: array[0..2] of byte;
  amt: integer;
begin

  FileMode := fmOpenRead;
  AssignFile(txt, FileName);
  Reset(txt, 1);

  try
    BlockRead(txt, bytes, 3, amt);
    result := (amt=3) and (bytes[0] = $EF) and (bytes[1] = $BB) and (bytes[2] = $BF);
  finally    
    CloseFile(txt);
  end;

end;