String.Split works strange when last value is empty
Asked Answered
P

2

9

I'd like to split my string to array but it works bad when last "value" is empty. See my example please. Is it bug or feature? Is there any way how to use this function without workarounds?

var
  arr: TArray<string>;

  arr:='a;b;c'.Split([';']); //length of array = 3, it's OK
  arr:='a;b;c;'.Split([';']); //length of array = 3, but I expect 4
  arr:='a;b;;c'.Split([';']); //length of array = 4 since empty value is inside
  arr:=('a;b;c;'+' ').Split([';']); //length of array = 4 (primitive workaround with space)
Puttee answered 9/2, 2015 at 13:27 Comment(6)
That's just how it was designed. If you don't like it, write your own split function.Kepner
What happens with ';x'? Do you get one value or two? If you get two then the design is asymmetrical, a bad thing.Kepner
Strange, but it returns 2.Retroact
I experienced a similar problem with TStrings.CommaText in D5 and had to work around it. I noticed that was fixed sometime before D2007 though.Fourgon
System.StrUtils.SplitString is implemented the way you want it to be.Aalst
Bug report filed, quality.embarcadero.com/browse/RSP-11302 "TStringHelper.Split disfunctional".Glooming
K
8

This behaviour can't be changed. There's no way for you to customise how this split function works. I suspect that you'll need to provide your own split implementation. Michael Erikkson helpfully points out in a comment that System.StrUtils.SplitString behaves in the manner that you desire.

The design seems to me to be poor. For instance

Length('a;'.Split([';'])) = 1

and yet

Length(';a'.Split([';'])) = 2

This asymmetry is a clear indication of poor design. It's astonishing that testing did not identify this.

The fact that the design is so clearly suspect means that it may be worth submitting a bug report. I'd expect it to be denied since any change would impact existing code. But you never know.

My recommendations:

  1. Use your own split implementation that performs as you require.
  2. Submit a bug report.

Whilst System.StrUtils.SplitString does what you want, its performance is not great. That very likely does not matter. In which case you should use it. However, if performance matters, then I offer this:

{$APPTYPE CONSOLE}

uses
  System.SysUtils, System.Diagnostics, System.StrUtils;

function MySplit(const s: string; Separator: char): TArray<string>;
var
  i, ItemIndex: Integer;
  len: Integer;
  SeparatorCount: Integer;
  Start: Integer;
begin
  len := Length(s);
  if len=0 then begin
    Result := nil;
    exit;
  end;

  SeparatorCount := 0;
  for i := 1 to len do begin
    if s[i]=Separator then begin
      inc(SeparatorCount);
    end;
  end;

  SetLength(Result, SeparatorCount+1);
  ItemIndex := 0;
  Start := 1;
  for i := 1 to len do begin
    if s[i]=Separator then begin
      Result[ItemIndex] := Copy(s, Start, i-Start);
      inc(ItemIndex);
      Start := i+1;
    end;
  end;
  Result[ItemIndex] := Copy(s, Start, len-Start+1);
end;

const
  InputString = 'asdkjhasd,we1324,wqweqw,qweqlkjh,asdqwe,qweqwe,asdasdqw';

var
  i: Integer;
  Stopwatch: TStopwatch;

const
  Count = 3000000;

begin
  Stopwatch := TStopwatch.StartNew;
  for i := 1 to Count do begin
    InputString.Split([',']);
  end;
  Writeln('string.Split: ', Stopwatch.ElapsedMilliseconds);

  Stopwatch := TStopwatch.StartNew;
  for i := 1 to Count do begin
    System.StrUtils.SplitString(InputString, ',');
  end;
  Writeln('StrUtils.SplitString: ', Stopwatch.ElapsedMilliseconds);

  Stopwatch := TStopwatch.StartNew;
  for i := 1 to Count do begin
    MySplit(InputString, ',');
  end;
  Writeln('MySplit: ', Stopwatch.ElapsedMilliseconds);
end.

The output of a 32 bit release build with XE7 on my E5530 is:

string.Split: 2798
StrUtils.SplitString: 7167
MySplit: 1428
Kepner answered 9/2, 2015 at 14:8 Comment(3)
This behaviour is mostly based on too much thinking what should have the programmer in mind, instead of building it on a fixed logical rule: "The separator separates two values" and everbody should now, that the result array will contain separater count plus one values.Implicate
@SirRufo While I agree wholeheartedly with the sentiment, I don't think the behaviour is intentional. In D5 TStrings.CommaText has a similar problem. When I looked at the code it was a simple bug: If a , appeared as the last char the input string, it would: read the character, so next Char is #0 (terminator for PChar) and start next iteration of the loop. But NextChar = #0 was the terminating condition for the loop, so that would end the loop. By D2007 this had been fixed with extra code to explicitly add an empty string in this case.Fourgon
@CraigYoung It is a bug indeed, but I was talking about how this buggy implementations are mostly caused by. Someone implement it and I hope will test it (so do I). But I guess there is no UnitTest and/or no description of the whole behaviour. Sometimes I think the UnitTest is simply done by: "It compiles!" - That leads me to build some (not too many so far but growing) UnitTest for the RTL.Implicate
A
2

The following is very similar to the accepted answer but i) it is a helper method and ii) it accepts an array of separators.

The method takes about 30% longer than David's for these reasons, but may be useful anyway.

program ImprovedSplit;

{$APPTYPE CONSOLE}

uses
  System.SysUtils;

type
  TStringHelperEx = record helper for string
  public
    function SplitEx(const Separator: array of Char): TArray<string>;
  end;

var
  TestString : string;
  StringArray : TArray<String>;


{ TStringHelperEx }

function TStringHelperEx.SplitEx( const Separator: array of Char ): TArray<string>;
var
  Str : string;
  Buf, Token : PChar;
  i, cnt : integer;
  sep : Char;
begin
  cnt := 0;
  Str := Self;
  Buf := @Str[1];
  SetLength(Result, 0);

  if Assigned(Buf) then begin

    for sep in Separator do begin
      for i := 0 to Length(Self) do begin
        if Buf[i] = sep then begin
          Buf[i] := #0;
          inc(cnt);
        end;
      end;
    end;

    SetLength(Result, cnt + 1);

    Token := Buf;
    for i := 0 to cnt do begin
      Result[i] := StrPas(Token);
      Token := Token + Length(Token) + 1;
    end;

  end;
end;

begin
  try
    TestString := '';
    StringArray := TestString.SplitEx([';']);
    Assert(Length(StringArray) = 0, 'Failed test for Empty String');

    TestString := 'a';
    StringArray := TestString.SplitEx([';']);
    Assert(Length(StringArray) = 1, 'Failed test for Single String');

    TestString := ';';
    StringArray := TestString.SplitEx([';']);
    Assert(Length(StringArray) = 2, 'Failed test for Single Separator');

    TestString := 'a;';
    StringArray := TestString.SplitEx([';']);
    Assert(Length(StringArray) = 2, 'Failed test for Single String + Single End-Separator');

    TestString := ';a';
    StringArray := TestString.SplitEx([';']);
    Assert(Length(StringArray) = 2, 'Failed test for Single String + Single Start-Separator');

    TestString := 'a;b;c';
    StringArray := TestString.SplitEx([';']);
    Assert(Length(StringArray) = 3, 'Failed test for Simple Case');

    TestString := ';a;b;c;';
    StringArray := TestString.SplitEx([';']);
    Assert(Length(StringArray) = 5, 'Failed test for Start and End Separator');

    TestString := '0;1;2;3;4;5;6;7;8;9;0;1;2;3;4;5;6;7;8;9;0;1;2;3;4;5;6;7;8;9;0;1;2;3;4;5;6;7;8;9';
    StringArray := TestString.SplitEx([';', ',']);
    Assert(Length(StringArray) = 40, 'Failed test for Larger Array');

    TestString := '0;1;2;3;4;5;6;7;8;9;0;1;2;3;4;5;6;7;8;9;0,1,2,3,4,5,6,7,8,9,0;1;2;3;4;5;6;7;8;9';
    StringArray := TestString.SplitEx([';', ',']);
    Assert(Length(StringArray) = 40, 'Failed test for Array of Separators');

    Writeln('No Errors');

  except
    on E: Exception do
      Writeln(E.ClassName, ': ', E.Message);
  end;

  Writeln('Press ENTER to continue');
  Readln(TestString);

end.
Astrahan answered 9/2, 2015 at 18:6 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.