TStringList splitting bugs
Asked Answered
S

4

31

Recently I've been informed by a reputable SO user, that TStringList has splitting bugs which would cause it to fail parsing CSV data. I haven't been informed about the nature of these bugs, and a search on the internet including Quality Central did not produce any results, so I'm asking. What are TStringList splitting bugs?

Please note, I'm not interested in unfounded opinion based answers.


What I know:

Not much... One is that, these bugs show up rarely with test data, but not so rarely in real world.

The other is, as stated, they prevent proper parsing of CSV. Thinking that it is difficult to reproduce the bugs with test data, I am (probably) seeking help from whom have tried using a string list as a CSV parser in production code.

Irrelevant problems:

I obtained the information on a 'Delphi-XE' tagged question, so failing parsing due to the "space character being considered as a delimiter" feature do not apply. Because the introduction of the StrictDelimiter property with Delphi 2006 resolved that. I, myself, am using Delphi 2007.

Also since the string list can only hold strings, it is only responsible for splitting fields. Any conversion difficulty involving field values (f.i. date, floating point numbers..) arising from locale differences etc. are not in scope.

Basic rules:

There's no standard specification for CSV. But there are basic rules inferred from various specifications.

Below is demonstration of how TStringList handles these. Rules and example strings are from Wikipedia. Brackets ([ ]) are superimposed around strings to be able to see leading or trailing spaces (where relevant) by the test code.


Spaces are considered part of a field and should not be ignored.

Test string: [1997, Ford , E350]
Items: [1997] [ Ford ] [ E350]


Fields with embedded commas must be enclosed within double-quote characters.

Test string: [1997,Ford,E350,"Super, luxurious truck"]
Items: [1997] [Ford] [E350] [Super, luxurious truck]


Fields with embedded double-quote characters must be enclosed within double-quote characters, and each of the embedded double-quote characters must be represented by a pair of double-quote characters.

Test string: [1997,Ford,E350,"Super, ""luxurious"" truck"]
Items: [1997] [Ford] [E350] [Super, "luxurious" truck]


Fields with embedded line breaks must be enclosed within double-quote characters.

Test string: [1997,Ford,E350,"Go get one now
they are going fast"]
Items: [1997] [Ford] [E350] [Go get one now
they are going fast]


In CSV implementations that trim leading or trailing spaces, fields with such spaces must be enclosed within double-quote characters.

Test string: [1997,Ford,E350," Super luxurious truck "]
Items: [1997] [Ford] [E350] [ Super luxurious truck ]


Fields may always be enclosed within double-quote characters, whether necessary or not.

Test string: ["1997","Ford","E350"]
Items: [1997] [Ford] [E350]



Testing code:

var
  SL: TStringList;
  rule: string;

  function GetItemsText: string;
  var
    i: Integer;
  begin
    for i := 0 to SL.Count - 1 do
      Result := Result + '[' + SL[i] + '] ';
  end;

  procedure Test(TestStr: string);
  begin
    SL.DelimitedText := TestStr;
    Writeln(rule + sLineBreak, 'Test string: [', TestStr + ']' + sLineBreak,
            'Items: ' + GetItemsText + sLineBreak);
  end;

begin
  SL := TStringList.Create;
  SL.Delimiter := ',';        // default, but ";" is used with some locales
  SL.QuoteChar := '"';        // default
  SL.StrictDelimiter := True; // required: strings are separated *only* by Delimiter

  rule := 'Spaces are considered part of a field and should not be ignored.';
  Test('1997, Ford , E350');

  rule := 'Fields with embedded commas must be enclosed within double-quote characters.';
  Test('1997,Ford,E350,"Super, luxurious truck"');

  rule := 'Fields with embedded double-quote characters must be enclosed within double-quote characters, and each of the embedded double-quote characters must be represented by a pair of double-quote characters.';
  Test('1997,Ford,E350,"Super, ""luxurious"" truck"');

  rule := 'Fields with embedded line breaks must be enclosed within double-quote characters.';
  Test('1997,Ford,E350,"Go get one now'#10#13'they are going fast"');

  rule := 'In CSV implementations that trim leading or trailing spaces, fields with such spaces must be enclosed within double-quote characters.';
  Test('1997,Ford,E350," Super luxurious truck "');

  rule := 'Fields may always be enclosed within double-quote characters, whether necessary or not.';
  Test('"1997","Ford","E350"');

  SL.Free;
end;



If you've read it all, the question was :), what are "TStringList splitting bugs?"

Saito answered 23/6, 2011 at 22:5 Comment(12)
+1, I have heard this before but never validated it.Phallus
Why not ask the reputable SO user who told you about it? (Who was it, anyway?)Cenac
I've never had any problems (other than lack of StrictDelimiter in early versions) after 13 years using TStringList. It may not do exactly what everyone thinks it should, and "fail" on input that don't match its specification, but is is internally consistent. Mind you the first subclass of RTL code I ever wrote was reimplementing CommaText to handle spaces in fields without quotes to handle poorly formed input.Bruns
@Mason - I did ask, but did not get a response. I don't know if it's Ok to tell who he is.Saito
@Gerry - I appreciate your input about "out of specification data", thanks!Saito
+1 for joining "luxurious", "truck" and "Ford" into the single entity :-)Brindabrindell
i think the majority of the issue here is the lack of a strict CSV format and the various flavours you will see in the wild. Also being a Delphi7 user - this is where TStringList does NOT work correctly and may have prompted the initial 'splitting bugs' comment from the reputable user.Iridic
The author of JvCSVDataset (Warren Postma) may be able to help you.Iridic
+1 Your "rules" scream to be converted into unit tests.Ymir
Including floating point values and cross-border exchange of csv files, does not make it easier. See [en.wikipedia.org/wiki/Comma-separated_values] for more input. Some "standard" seems to exist RFC4180 [tools.ietf.org/html/rfc4180]Tsaritsyn
@Mason, the original question is here: #6386236Litton
GREAT question. corrects a few bits of 2009 era (outdated information really) things that I had spouted off without checking if it still applied to 2010/XE.Flounder
C
13

Not much... One is that, these bugs show up rarely with test data, but not so rarely in real world.

All it takes is one case. Test data is not random data, one user with one failure case should submit the data and voilà, we've got a test case. If no one can provide test data, maybe there's no bug/failure?

There's no standard specification for CSV.

That one sure helps with the confusion. Without a standard specification, how do you prove something is wrong? If this is left to one's own intuition, you might get into all kinds of troubles. Here's some from my own happy interaction with government issued software; My application was supposed to export data in CSV format, and the government application was supposed to import it. Here's what got us into a lot of trouble several years in a row:

  • How do you represent empty data? Since there's no CSV standard, one year my friendly gov decided anything goes, including nothing (two consecutive commas). Next they decided only consecutive commas are OK, that is, Field,"",Field is not valid, should be Field,,Field. Had a lot of fun explaining to my customers that the gov app changed validation rules from one week to the next...
  • Do you export ZERO integer data? This was probably an bigger abuse, but my "gov app" decided to validate that also. At one time it was mandatory to include the 0, then it was mandatory NOT to include the 0. That is, at one time Field,0,Field was valid, next Field,,Field was the only valid way...

And here's an other test-case where (my) intuition failed:

1997, Ford, E350, "Super, luxurious truck"

Please note the space between , and "Super, and the very lucky comma that follows "Super. The parser employed by TStrings only sees the quote char if it immediately follows the delimiter. That string is parsed as:

[1997]
[ Ford]
[ E350]
[ "Super]
[ luxurious truck"]

Intuitively I'd expect:

[1997]
[ Ford]
[ E350]
[Super luxurious truck]

But guess what, Excel does it the same way Delphi does it...

Conclusion

  • TStrings.CommaText is fairly good and nicely implemented, at least the Delphi 2010 version I looked at is quite effective (avoids multiple string allocations, uses a PChar to "walk" the parsed string) and works about the same as Excel's parser does.
  • In the real world you'll need to exchange data with other software, written using other libraries (or no libraries at all), where people might have miss-interpreted some of the (missing?) rules of CSV. You'll have to adapt, and it'll probably not be a case of right-or-wrong but a case of "my clients need to import this crap". If that happens, you'll have to write your own parser, one that adapts to the requirements of the 3rd party app you'd be dealing with. Until that happens, you can safely use TStrings. And when it does happen, it might not be TString's fault!
Callista answered 24/6, 2011 at 8:11 Comment(2)
Cosmin, if we are to take basic rules as specification, I guess your test-case is out of spec. Since spaces should not be ignored (rule #1), the double-quote is embedded in the field, hence it violates rule #3 (it should be escaped by a double-quote and the field should be surrounded by double-quotes). Nevertheless your intuition makes sense and demonstrates potential confusion. For that, and for all the other good information, Thanks!Saito
I think Cosmin's point is that there are no "basic rules" in the real world - because there is no definitive CSV spec you can point to ;) Very good post, Cosmin.Tsana
K
4

I'm going to go out on a limb and say that the most common failure case is the embedded linebreak. I know most of the CSV parsing I do ignores that. I'll use 2 TStringLists, 1 for the file I'm parsing, the other for the current line. So I'll end up with code similar to the following:

procedure Foo;
var
    CSVFile, ALine: TStringList;
    s: string;

begin
    CSVFile := TStringList.Create;
    ALine := TStringList.Create;
    ALine.StrictDelimiter := True;
    CSVFile.LoadFromFile('C:\Path\To\File.csv');
    for s in CSVFile do begin
        ALine.CommaText := s;
        DoSomethingInteresting(ALine);
    end;
end;

Of course, since I'm not taking care to make sure that each line is "complete", I can potentially run into cases where the input contains a quoted linebreak in a field and I miss it.

Until I run into real world data where it's an issue, I'm not going to bother fixing it. :-P

Keenakeenan answered 24/6, 2011 at 12:58 Comment(4)
The Delphi 2010 parser handles line-brakes in quoted CSV just fine. Here's a PasteBin console application that shows this in actionCallista
@Cosmin: Read my code more carefully. If you read a file into a stringlist, then process it line by line, you're not going to catch the embedded linebreak.Keenakeenan
I find it unlikely that this would be a common failure case, you haven't run into real world data that this would be a problem after all.. ;) Besides, I believe any developer running into this would realize that it's not a failing on TStringList's part, as it's working on what's given to it..Saito
Delphi 2010/XE TStringList fixes a lot of CSV quoting issues that did NOT work when I was using them in 2009. That means, I have to stop saying that this particular aspect is broken in Delphi, since it now WORKS.Flounder
C
0

Another example... this TStringList.CommaText bug exists in Delphi 2009.

procedure TForm1.Button1Click(Sender: TObject);
var
  list : TStringList;
begin
  list := TStringList.Create();
  try
    list.CommaText := '"a""';
    Assert(list.Count = 1);
    Assert(list[0] = 'a');
    Assert(list.CommaText = 'a'); // FAILS -- actual value is "a""
  finally
    FreeAndNil(list);
  end;
end;

The TStringList.CommaText setter and related methods corrupt the memory of the string that holds the a item (its null terminator character is overwritten by a ").

Coeval answered 6/8, 2013 at 21:12 Comment(2)
As I see it your example does not comply with the rules that the RTL seems to follow. You have an embedded quote but it's not doubled (as it should be), the code is free to spit out as it sees fit.Saito
True, I fed invalid CSV to the RTL. But the intent of the RTL (based on looking at the source of TStringList) is to handle any input and only return valid output. My "bug" violated that, I think.Coeval
M
0

Already tried use TArray<String> split?

var
text: String;
arr: TArray<String>;
begin
text := '1997,Ford,E350';
arr := text.split([',']);

So arr would be:

arr[0] = 1997;
arr[1] = Ford;
arr[2] = E350;
Melodeemelodeon answered 11/8, 2018 at 2:0 Comment(1)
it belongs to TStringHelper from System.SysUtils, not to TArray.Hornbook

© 2022 - 2024 — McMap. All rights reserved.