TStringList behavior with non ANSI files
Asked Answered
P

1

5

In my application, when I want import a file, i use TStringList.

But, when someone export data from Excel, the file encoding is UCS-2 Little Endian, and TStringList can't read the data.

There is any way to validate this situation, identify the text encoding and send a warning to the user that the text provided is not compatible?

Just to be clear, the user will provide only plain text..letter and numbers, otherwise this, I must send the warning.

Unicode File without BOM is good. (TStringList can read it!)
ANSI file Too. (TStringList can read it!)
Even Unicode with BOM will be good, if there is a way to remove it. (TStringList can read it!, but with "i" ">>" and "reverse ?" characters, that belongs to BOM bytes)

Pernick answered 26/4, 2013 at 15:44 Comment(3)
@DavidHeffernan yes, delphi 6Pernick
possible duplicate of Handling of Unicode Characters using Delphi 6Palladic
Jedi Code Library has Unicode-enabled stringlists ready to be usedResolution
E
8

I used the following function in Delphi 6 to detect Unicode BOMs.

const
  //standard byte order marks (BOMs)
  UTF8BOM:              array [0..2] of AnsiChar = #$EF#$BB#$BF;
  UTF16LittleEndianBOM: array [0..1] of AnsiChar = #$FF#$FE;
  UTF16BigEndianBOM:    array [0..1] of AnsiChar = #$FE#$FF;
  UTF32LittleEndianBOM: array [0..3] of AnsiChar = #$FF#$FE#$00#$00;
  UTF32BigEndianBOM:    array [0..3] of AnsiChar = #$00#$00#$FE#$FF;

function FileHasUnicodeBOM(const FileName: string): Boolean;
var
  Buffer: array [0..3] of AnsiChar;
  Stream: TFileStream;
begin
  Stream := TFileStream.Create(FileName, fmOpenRead or fmShareDenyWrite); // Allow other programs read access at the same time.
  Try
    FillChar(Buffer, SizeOf(Buffer), $AA);//fill with characters that we are not expecting then...
    Stream.Read(Buffer, SizeOf(Buffer));  //...read up to SizeOf(Buffer) bytes - there may not be enough
    //use Read rather than ReadBuffer so the no exception is raised if we can't fill Buffer
  Finally
    FreeAndNil(Stream);
  End;
  Result := CompareMem(@UTF8BOM,              @Buffer, SizeOf(UTF8BOM))              or
            CompareMem(@UTF16LittleEndianBOM, @Buffer, SizeOf(UTF16LittleEndianBOM)) or
            CompareMem(@UTF16BigEndianBOM,    @Buffer, SizeOf(UTF16BigEndianBOM))    or
            CompareMem(@UTF32LittleEndianBOM, @Buffer, SizeOf(UTF32LittleEndianBOM)) or
            CompareMem(@UTF32BigEndianBOM,    @Buffer, SizeOf(UTF32BigEndianBOM));
end;

This will detect all the standard BOMs. You could use it to block such files if that's the behaviour you want.

You state that Delphi 6 TStringList can load 16 bit encoded files if they do not have a BOM. Whilst that may be the case, you will find that, for characters in the ASCII range, every other character is #0. Which I guess is not what you want.

If you want to detect that text is Unicode for files without BOMs then you could use IsTextUnicode. However, it may give false positives. This is a situation where I suspect it is better to ask for forgiveness than permission.

Now, if I were you I would not actually try to block Unicode files. I would read them. Use the TNT Unicode library. The class you want is called TWideStringList.

Eubanks answered 26/4, 2013 at 16:1 Comment(16)
USC2 is an older format from early Unicode. It's a fixed length 16 bit encoding. The first versions of Windows NT were based on UCS2, as was early Java, I believe. But UTF-16 is a variable length encoding. Some code points need more than one character element. Those use surrogate pairs. However, UCS2 is pretty much a subset of UTF-16 as I understand it. If you read a UCS2 file as if it was UTF-16 you'll get sensible data.Eubanks
Yes, that would block any file that contained a known Unicode BOMEubanks
"detect Unicode" is pretty bold claim, your code only detects BOM presence and is open sesame any files w/o BOM.Erastian
@user539484 Yeah, change the name to TextFileHasKnownUnicodeBOM. It is of course impossible to detect Unicode with total accuracy.Eubanks
Yes, you should change the name, UTF-8 files w/o BOM is very common. MS heuristic implementations such as IsTextUnicode or DetectInputCodepage are much more reliable.Erastian
@MatheusFreitas It reads it, but it interprets it as ANSI rather than UTF8. If you then pass that on to something that expects UTF8 you'll be fine. Otherwise, if you pass it to something that expects ANSI, you'll have incorrect transmission for characters outside ASCII range,.Eubanks
@Matheus Freitas, yes, TStrings can. If you handle misinterpreted characters elsewhere, it will fit your purpose. Function name and the claim to detect Unicode is still very wrong.Erastian
@user539484 No matter how many times you tell me that the name is wrong, it will still be true. You should also take it up with MS since their function IsTextUnicode is also named incorrectly. ;-)Eubanks
@Matheus Freitas, I have no idea what you talking about. Apparently you are happy already with your answer which boldly "detects Unicode" :-)Erastian
@user539484 I think you are being very dismissive of Matheus. I'm sure he is perfectly capable of naming the function however he wishes. Do you really think so little of him that you believe he is only capable of pasting the code and not modifying it? You've made your point, and I completely agree with you.Eubanks
@Matheus Freitas, I was been dismissive toward your really unspecific CLR reference, if you make a specific point, we can surely discuss it further and probably even compare with implementations I mentioned above. David, my objection was targeted to the claim of Unicode text detection made easy which was misleading (you can check partial implementation in WINE sources)Erastian
@MatheusFreitas And the BOM detection is a cheap and simple way to do that, right?Eubanks
@DavidHeffernan Now, I just want to let this said for future reference, I believe, after some reading I am now able to understand a little bit about unicode. The function provided did not solved the problem because the user sent me files without BOM. The solution: none. I actually told him to encode them files as ANSI (or UTF-8), since I'm porting the app to an unicode version of Delphi. The function provided is indeed correct, it does what it states so, I mean, detects unicode BOM presence.Pernick
@DavidHeffernan I told user to encode to UTF-8(without BOM) as well, because all data sent to me is local names (does not uses Chinese caracteres for example) and numbers. Never happened to cross ASCII limits.Pernick
Generally speaking the only way to do this well is for both parties to agree on a format for interop. Trying to detect the format is prone to error. Detecting BOM is fine so long as all parties agree to include a BOM. But if the BOM is not there then really the only 100% solution is for both parties to agree on an encoding.Eubanks
@DavidHeffernan Yes that is correct. The problem is that, a few months ago, I didn't even know Unicode existence. I am still learning about it. But I believe that agreement is a good idea, as you said.Pernick

© 2022 - 2024 — McMap. All rights reserved.