Detecting 'text' file type (ANSI vs UTF-8)
Asked Answered
L

9

11

I wrote an application (a psychological testing exam) in Delphi (7) which creates a standard text file - ie the file is of type ANSI.

Someone has ported the program to run on the Internet, probably using Java, and the resulting text file is of type UTF-8.

The program which reads these results files will have to read both the files created by Delphi and the files created via the Internet.

Whilst I can convert the UTF-8 text to ANSI (using the cunningly named function UTF8ToANSI), how can I tell in advance which kind of file I have?

Seeing as I 'own' the file format, I suppose the easiest way to deal with this would be to place a marker within the file at a known position which will tell me the source of the program (Delphi/Internet), but this seems to be cheating.

Thanks in advance.

Lubalubba answered 5/2, 2011 at 16:11 Comment(6)
Putting a marker indicating the encoding is not cheating, it's fairly standard (XML does it). The question is rather if converting your old files is a problem.Udine
if you own the file then just put a bom in and it's all goodThenceforth
Make your own format use UTF-8 for new files too. Using a locale dependent charset leads to many horrors.Nopar
A BOM can mess up applications, I would never add one to a UTF-8 encoded file - unless I am forced to :)Contravene
A textfile can be both ANSI and UTF8 if it sticks to the ASCII subsetAmourpropre
It seems that the Internet file does have a BOM so I'm going to check for this first before using the UTF8ToANSI function. Thanks to all.Psychodrama
C
2

If the UTF file begins with the UTF-8 Byte-Order Mark (BOM), this is easy:

function UTF8FileBOM(const FileName: string): boolean;
var
  txt: file;
  bytes: array[0..2] of byte;
  amt: integer;
begin

  FileMode := fmOpenRead;
  AssignFile(txt, FileName);
  Reset(txt, 1);

  try
    BlockRead(txt, bytes, 3, amt);
    result := (amt=3) and (bytes[0] = $EF) and (bytes[1] = $BB) and (bytes[2] = $BF);
  finally    
    CloseFile(txt);
  end;

end;

Otherwise, it is much more difficult.

Corena answered 5/2, 2011 at 16:15 Comment(15)
What if the BOM has a valid interpretation in the "ANSI" character set?Udine
Finding a BOM on UTF-8 data is pretty rare, as UTF-8 is endianness-agnostic and hence doesn't require a BOM to determine byte order.Decennium
@larsmans "ANSI" is usually just an alias for "Windows-1252". So yes, the BOM does have a valid interpretation in "ANSI"...Decennium
@dkarp: Yes, and we all know how the BOM looks when interprated in 1252.Corena
@Decennium your points are all correct but seem irrelevant to this questionThenceforth
@Andreas Oh, yeah. But still, -1 to this answer. You really can't count on having a BOM in UTF-8 data. A good answer would try to test if the data is valid UTF-8...Decennium
@David: This answer basically says "Look for the BOM." (And that's all the code does.) Except that 9 times out of 10, a UTF-8 file doesn't have a BOM since it doesn't need a BOM...Decennium
@dkarp: Well, I did write " If the UTF file begins with the UTF-8 Byte-Order Mark (BOM), this is easy:". Thus, I gave a sufficient (well) condition, but not a necessary one. (I even ended my answer with "Otherwise, it is much more difficult.".) I was unaware of any rule saying that an answer at SO had to be complete in order to be useful...Corena
@Decennium the word ANSI as Microsoft means the local legacy charset, and can differ from system to system depending on the OS language.Nopar
@CodeInChaos That makes a lot more sense, actually. Thanks!Decennium
I must be really thick, for I cannot see why anyone even moderately sober would downvote this...Corena
@Andreas While I don't like relying on BOM, I don't think this deserves a downvote. But it looks like somebody downvoted all answers to this thread.Nopar
@Andreas I'm one of the 2 downvotes, and I thought I explained why. 90+% of the time, your answer simply isn't helpful, as UTF-8 files very rarely have a BOM. It's kind of like answering "How do I replicate MySQL's utf8_unicode_ci in Java?" by saying "Well, if both strings are empty, you return 0. Otherwise, it is much more difficult." Yes, that's true. But not helpful.Decennium
I'm accepting this answer as the file created by the Internet version of the program does indeed have a BOM - its first three characters are EF BB BF. I'm going to ask the person who created the Internet version to create some more files so I can check this more thoroughly. Thanks to all who participated.Psychodrama
+1, as Andreas states the answer is correct and the conditions are stated. No reason at all to downvote it!Darned
T
22

There is no 100% sure way to recognize ANSI (e.g. Windows-1250) encoding from UTF-8 encoding. There are ANSI files which cannot be valid UTF-8, but every valid UTF-8 file might as well be a different ANSI file. (Not to mention ASCII-only data, which are both ANSI and UTF-8 by definition, but that is purely a theoretical aspect.)

For instance, the sequence C4 8D might be the “č” character in UTF-8, or it might be “ÄŤ” in windows-1250. Both are possible and correct. However, e.g. 8D 9A can be “Ťš” in windows-1250, but it is not a valid UTF-8 string.

You have to resort to some kind of heuristic, e.g.

  1. If the file contains a sequence which cannot be a valid UTF-8, assume it is ANSI.
  2. Otherwise, if the file begins with UTF-8 BOM (EF BB BF), assume it is UTF-8 (it might not be, however, plain text ANSI file beginning with such characters is very improbable).
  3. Otherwise, assume it is UTF-8. (Or, try more heuristics, maybe using the knowledge of the language of the text, etc.)

See also the method used by Notepad.

Transubstantiation answered 5/2, 2011 at 16:28 Comment(2)
+1, Though I would exclude UTF-16LE and UTF-16BE based on their byte order marks and optionally on zero/non-zero alternating byte occurences, BEFORE deciding on UTF-8...Nicholasnichole
@Marjan - Well, the specific question here was about distinguishing ANSI from UTF-8; UTF-16 is not expected at all. But in the generic case, you are right, there are many more questions to ask. (And the IsTextUnicode method mentioned in the link would help with that UTF-16 case.)Transubstantiation
C
2

If the UTF file begins with the UTF-8 Byte-Order Mark (BOM), this is easy:

function UTF8FileBOM(const FileName: string): boolean;
var
  txt: file;
  bytes: array[0..2] of byte;
  amt: integer;
begin

  FileMode := fmOpenRead;
  AssignFile(txt, FileName);
  Reset(txt, 1);

  try
    BlockRead(txt, bytes, 3, amt);
    result := (amt=3) and (bytes[0] = $EF) and (bytes[1] = $BB) and (bytes[2] = $BF);
  finally    
    CloseFile(txt);
  end;

end;

Otherwise, it is much more difficult.

Corena answered 5/2, 2011 at 16:15 Comment(15)
What if the BOM has a valid interpretation in the "ANSI" character set?Udine
Finding a BOM on UTF-8 data is pretty rare, as UTF-8 is endianness-agnostic and hence doesn't require a BOM to determine byte order.Decennium
@larsmans "ANSI" is usually just an alias for "Windows-1252". So yes, the BOM does have a valid interpretation in "ANSI"...Decennium
@dkarp: Yes, and we all know how the BOM looks when interprated in 1252.Corena
@Decennium your points are all correct but seem irrelevant to this questionThenceforth
@Andreas Oh, yeah. But still, -1 to this answer. You really can't count on having a BOM in UTF-8 data. A good answer would try to test if the data is valid UTF-8...Decennium
@David: This answer basically says "Look for the BOM." (And that's all the code does.) Except that 9 times out of 10, a UTF-8 file doesn't have a BOM since it doesn't need a BOM...Decennium
@dkarp: Well, I did write " If the UTF file begins with the UTF-8 Byte-Order Mark (BOM), this is easy:". Thus, I gave a sufficient (well) condition, but not a necessary one. (I even ended my answer with "Otherwise, it is much more difficult.".) I was unaware of any rule saying that an answer at SO had to be complete in order to be useful...Corena
@Decennium the word ANSI as Microsoft means the local legacy charset, and can differ from system to system depending on the OS language.Nopar
@CodeInChaos That makes a lot more sense, actually. Thanks!Decennium
I must be really thick, for I cannot see why anyone even moderately sober would downvote this...Corena
@Andreas While I don't like relying on BOM, I don't think this deserves a downvote. But it looks like somebody downvoted all answers to this thread.Nopar
@Andreas I'm one of the 2 downvotes, and I thought I explained why. 90+% of the time, your answer simply isn't helpful, as UTF-8 files very rarely have a BOM. It's kind of like answering "How do I replicate MySQL's utf8_unicode_ci in Java?" by saying "Well, if both strings are empty, you return 0. Otherwise, it is much more difficult." Yes, that's true. But not helpful.Decennium
I'm accepting this answer as the file created by the Internet version of the program does indeed have a BOM - its first three characters are EF BB BF. I'm going to ask the person who created the Internet version to create some more files so I can check this more thoroughly. Thanks to all who participated.Psychodrama
+1, as Andreas states the answer is correct and the conditions are stated. No reason at all to downvote it!Darned
F
1

If we summerize, then:

  • Best solution for basic usage is to use outdated ( if we use IsTextUnicode(); );
  • Best solution for advanced usage is to use function above, then check BOM ( ~ 1KB ), then check Locale info under particual OS and only then get about 98% accuracy?

OTHER INFO PEOPLE MAY FOUND INTERESTING:

https://groups.google.com/forum/?lnk=st&q=delphi+WIN32+functions+to+detect+which+encoding++is+in+use&rnum=1&hl=pt-BR&pli=1#!topic/borland.public.delphi.internationalization.win32/_LgLolX25OA

function FileMayBeUTF8(FileName: WideString): Boolean;
var
 Stream: TMemoryStream;
 BytesRead: integer;
 ArrayBuff: array[0..127] of byte;
 PreviousByte: byte;
 i: integer;
 YesSequences, NoSequences: integer;

begin
   if not WideFileExists(FileName) then
     Exit;
   YesSequences := 0;
   NoSequences := 0;
   Stream := TMemoryStream.Create;
   try
     Stream.LoadFromFile(FileName);
     repeat

     {read from the TMemoryStream}

       BytesRead := Stream.Read(ArrayBuff, High(ArrayBuff) + 1);
           {Do the work on the bytes in the buffer}
       if BytesRead > 1 then
         begin
           for i := 1 to BytesRead-1 do
             begin
               PreviousByte := ArrayBuff[i-1];
               if ((ArrayBuff[i] and $c0) = $80) then
                 begin
                   if ((PreviousByte and $c0) = $c0) then
                     begin
                       inc(YesSequences)
                     end
                   else
                     begin
                       if ((PreviousByte and $80) = $0) then
                         inc(NoSequences);
                     end;
                 end;
             end;
         end;
     until (BytesRead < (High(ArrayBuff) + 1));
//Below, >= makes ASCII files = UTF-8, which is no problem.
//Simple > would catch only UTF-8;
     Result := (YesSequences >= NoSequences);

   finally
     Stream.Free;
   end;
end;

Now testing this function...

In my humble opinion only way how to START doing this check correctly is to check OS charset in first place because in the end there almost in all cases are made some references to OS. No way to scape it anyway...

Remarks:

Fred answered 25/2, 2011 at 9:34 Comment(2)
If you combine the marked answer and this one into one procedure, then this would be pretty accurate and fast.Forta
AFAICS this function (somewhat) counts valid 2-byte code points and an specific case of broken code points (a surrogate-looking byte preceded by an ASCII char). Its main problem is the reliance on matching surrogate-looking bytes, that makes it return True for many/most ISO-8859-1 files, since the characters in $80 to $c0 range are not that frequently used (except maybe for the inverted-? symbol for Spanish).Ferine
N
0

When reading first try parsing the file as UTF-8. If it isn't valid UTF-8 interpret the file as the legacy encoding(ANSI). This will work on most files, since it's very unlikely that a legacy encoded file will be valid UTF-8.

What windows calls ANSI is a system locale dependent charset. And the text won't work correctly on a Russian, or Asian or... windows.

While the VCL doesn't support Unicode in Delphi 7, you still should internally work with unicode and only convert to ANSI for displaying it. I localized one of my programs to Korean and Russian, and that was the only way I got it working without large problems. You still could only display the Korean localization on a system set to Korean, but at least the text-files could be edited on any system.

Nopar answered 5/2, 2011 at 16:56 Comment(0)
K
0
//if is possible to decoded,then it is UTF8

function isFileUTF8(const Tex : AnsiString): boolean;
begin
  result := (Tex <> '') and (UTF8Decode(Tex) <> '');
end;
Kalamazoo answered 30/10, 2019 at 0:12 Comment(1)
What does isFileUTF8('abcde') return?Teddy
V
0

As others said, there is no perfect way. You have to use heuristics. Here is a method I use which provide good results, assuming you already know ASCII charset (eg: ISO-8859-1 or Windows-1252):

  1. Check if there is a BOM header. if yes, it's UTF-8.
  2. Check if there is any character that is higher than 0x80 (except 0xA0 which is NBSP). If there isn't any, it's ASCII.
  3. Open the file as UTF-8. Check if all characters are within charset (eg: ISO-8859-1). If not, it's probably ASCII not UTF-8 (eg : if you got 汉, since it's not part of ISO-8859-1, it's probably ASCII).

If you don't know the charset in advance: follow step 1 and 2. For step 3 : open file in ASCII with different charsets (and as UTF-8). For each result, perform tests and calculate a score/confidence. Take the one that fits the best. This is how Notepad++ try to detect text encoding. See here and here.

Vigil answered 5/5, 2022 at 18:38 Comment(0)
G
0

This function does the basic parsing of multi-byte UTF-8 code points. For a more thorough implementation you may have a look on the C-implementations from this question: How to detect UTF-8 in plain C?

function FileIsValidUTF8(FileName: string): boolean;
var
  Stream: TStream;
  CurrByte: byte;
  SurrogateCount: integer;
  i: integer;
begin
  if not FileExists(FileName) then
    Exit;
  Stream := TFileStream.Create(FileName, fmOpenRead);
  try
    if Stream.Size = 0 then
      Exit(True);

    repeat
      CurrByte := Stream.ReadByte();

      // ascii
      if CurrByte <= 127 then
        continue;

      // leading code units
      if (CurrByte and $e0 = $c0) then
        SurrogateCount := 1
      else if (CurrByte and $f0 = $e0) then
        SurrogateCount := 2
      else if (CurrByte and $f8 = $f0) then
        SurrogateCount := 3
      else
        Exit(False); // invalid code unit

      if Stream.Position + SurrogateCount >= Stream.Size then
        Exit(False); // incomplete code point

      for i := 1 to SurrogateCount do
      begin
        // invalid surrogate
        if Stream.ReadByte() and $c0 <> $80 then
          Exit(False);
      end;
    until Stream.Position >= Stream.Size;
  finally
    Stream.Free;
  end;
  Result := True;
end;
Gustavogustavus answered 10/4 at 8:42 Comment(0)
P
0

I have a utility that works on Windows but not on Linux, although it's not perfect, it covers more.

function DetectEncoding(const Bytes: TBytes): TEncoding;
const
  BOM_UTF8: array [0 .. 2] of Byte = ($EF, $BB, $BF);
  BOM_UTF16_LE: array [0 .. 1] of Byte = ($FF, $FE);
  BOM_UTF16_BE: array [0 .. 1] of Byte = ($FE, $FF);
  // Add more BOMs here if needed
begin
  Result := TEncoding.Default; // Default to ANSI if no BOM is found

  if Length(Bytes) >= 3 then
  begin
    if CompareMem(@Bytes[0], @BOM_UTF8[0], Length(BOM_UTF8)) then
      Exit(TEncoding.UTF8)

    else if CompareMem(@Bytes[0], @BOM_UTF16_LE[0], Length(BOM_UTF16_LE)) then
      Exit(TEncoding.Unicode) // UTF-16 LE

    else if CompareMem(@Bytes[0], @BOM_UTF16_BE[0], Length(BOM_UTF16_BE)) then
      Exit(TEncoding.BigEndianUnicode); // UTF-16 BE
    // Add more checks here if needed
  end
  else if Length(Bytes) >= 2 then
  begin
    if CompareMem(@Bytes[0], @BOM_UTF16_LE[0], Length(BOM_UTF16_LE)) then
      Exit(TEncoding.Unicode) // UTF-16 LE

    else if CompareMem(@Bytes[0], @BOM_UTF16_BE[0], Length(BOM_UTF16_BE)) then
      Exit(TEncoding.BigEndianUnicode); // UTF-16 BE

    // Add more checks here if needed
  end;
end;
Parallelogram answered 10/4 at 13:47 Comment(0)
M
-1

Forget BOM and other advice. Here's what I found and keep for reference:

Valid UTF8 has a specific binary format. If it's a single byte UTF8 character, then it is always of form '0xxxxxxx', where 'x' is any binary digit. If it's a two byte UTF8 character, then it's always of form '110xxxxx10xxxxxx'."

Source.

By the way, you're mostly on your own. The knowledge of codepages, UTF etc. isn't that good in the West, so the quality of advice is similarly... questionable.

Mores answered 5/2, 2022 at 18:10 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.