Is there a Boyer-Moore string search and fast search and replace function and fast string count for Delphi 2010 String (UnicodeString) out there?
Asked Answered
L

2

19

I need three fast-on-large-strings functions: fast search, fast search and replace, and fast count of substrings in a string.

I have run into Boyer-Moore string searches in C++ and Python, but the only Delphi Boyer-Moore algorithm used to implement fast search and replace that I have found is part of the FastStrings by Peter Morris, formerly of DroopyEyes software, and his website and email are no longer working.

I have already ported FastStrings forward to work great for AnsiStrings in Delphi 2009/2010, where a byte is equal to one AnsiChar, but making them also work with the String (UnicodeString) in Delphi 2010 appears non-trivial.

Using this Boyer-Moore algorithm, it should be possible to easily do case insensitive searches, as well as case-insensitive search and replace, without any temporary string (using StrUpper etc), and without calling Pos() which is slower than Boyer-Moore searching when repeated searches over the same text are required.

(Edit: I have a partial solution, written as an answer to this question, it is almost 100% complete, it even has a fast string replace function. I believe it MUST have bugs, and especially think that since it pretends to be Unicode capable that it must be that there are glitches due to unfulfilled Unicode promises. )

(Edit2: Interesting and unexpected result; The large stack size of a unicode code-point table on the stack - SkipTable in the code below puts a serious damper on the amount of win-win-optimization you can do here in a unicode string boyer-moore string search. Thanks to Florent Ouchet for pointing out what I should have noticed immediately.)

Lx answered 22/7, 2010 at 16:3 Comment(0)
L
12

This answer is now complete and works for case sensitive mode, but does not work for case insensitive mode, and probably has other bugs too, since it's not well unit tested, and could probably be optimized further, for example I repeated the local function __SameChar instead of using a comparison function callback which would have been faster, and actually, allowing the user to pass in a comparison function for all these would be great for Unicode users who want to provide some extra logic (equivalent sets of Unicode glyphs for some languages).

Based on Dorin Dominica's code, I built the following.

{ _FindStringBoyer:
  Boyer-Moore search algorith using regular String instead of AnsiSTring, and no ASM.
  Credited to Dorin Duminica.
}
function _FindStringBoyer(const sString, sPattern: string;
  const bCaseSensitive: Boolean = True; const fromPos: Integer = 1): Integer;

    function __SameChar(StringIndex, PatternIndex: Integer): Boolean;
    begin
      if bCaseSensitive then
        Result := (sString[StringIndex] = sPattern[PatternIndex])
      else
        Result := (CompareText(sString[StringIndex], sPattern[PatternIndex]) = 0);
    end; // function __SameChar(StringIndex, PatternIndex: Integer): Boolean;

var
  SkipTable: array [Char] of Integer;
  LengthPattern: Integer;
  LengthString: Integer;
  Index: Integer;
  kIndex: Integer;
  LastMarker: Integer;
  Large: Integer;
  chPattern: Char;
begin
  if fromPos < 1 then
    raise Exception.CreateFmt('Invalid search start position: %d.', [fromPos]);
  LengthPattern := Length(sPattern);
  LengthString := Length(sString);
  for chPattern := Low(Char) to High(Char) do
    SkipTable[chPattern] := LengthPattern;
  for Index := 1 to LengthPattern -1 do
    SkipTable[sPattern[Index]] := LengthPattern - Index;
  Large := LengthPattern + LengthString + 1;
  LastMarker := SkipTable[sPattern[LengthPattern]];
  SkipTable[sPattern[LengthPattern]] := Large;
  Index := fromPos + LengthPattern -1;
  Result := 0;
  while Index <= LengthString do begin
    repeat
      Index := Index + SkipTable[sString[Index]];
    until Index > LengthString;
    if Index <= Large then
      Break
    else
      Index := Index - Large;
    kIndex := 1;
    while (kIndex < LengthPattern) and __SameChar(Index - kIndex, LengthPattern - kIndex) do
      Inc(kIndex);
    if kIndex = LengthPattern then begin
      // Found, return.
      Result := Index - kIndex + 1;
      Index := Index + LengthPattern;
      exit;
    end else begin
      if __SameChar(Index, LengthPattern) then
        Index := Index + LastMarker
      else
        Index := Index + SkipTable[sString[Index]];
    end; // if kIndex = LengthPattern then begin
  end; // while Index <= LengthString do begin
end;

{ Written by Warren, using the above code as a starter, we calculate the SkipTable once, and then count the number of instances of
  a substring inside the main string, at a much faster rate than we
  could have done otherwise.  Another thing that would be great is
  to have a function that returns an array of find-locations,
  which would be way faster to do than repeatedly calling Pos.
}
function _StringCountBoyer(const aSourceString, aFindString : String; Const CaseSensitive : Boolean = TRUE) : Integer;
var
  foundPos:Integer;
  fromPos:Integer;
  Limit:Integer;
  guard:Integer;
  SkipTable: array [Char] of Integer;
  LengthPattern: Integer;
  LengthString: Integer;
  Index: Integer;
  kIndex: Integer;
  LastMarker: Integer;
  Large: Integer;
  chPattern: Char;
    function __SameChar(StringIndex, PatternIndex: Integer): Boolean;
    begin
      if CaseSensitive then
        Result := (aSourceString[StringIndex] = aFindString[PatternIndex])
      else
        Result := (CompareText(aSourceString[StringIndex], aFindString[PatternIndex]) = 0);
    end; // function __SameChar(StringIndex, PatternIndex: Integer): Boolean;

begin
  result := 0;
  foundPos := 1;
  fromPos := 1;
  Limit := Length(aSourceString);
  guard := Length(aFindString);
  Index := 0;
  LengthPattern := Length(aFindString);
  LengthString := Length(aSourceString);
  for chPattern := Low(Char) to High(Char) do
    SkipTable[chPattern] := LengthPattern;
  for Index := 1 to LengthPattern -1 do
    SkipTable[aFindString[Index]] := LengthPattern - Index;
  Large := LengthPattern + LengthString + 1;
  LastMarker := SkipTable[aFindString[LengthPattern]];
  SkipTable[aFindString[LengthPattern]] := Large;
  while (foundPos>=1) and (fromPos < Limit) and (Index<Limit) do begin

    Index := fromPos + LengthPattern -1;
    if Index>Limit then
        break;
    kIndex := 0;
    while Index <= LengthString do begin
      repeat
        Index := Index + SkipTable[aSourceString[Index]];
      until Index > LengthString;
      if Index <= Large then
        Break
      else
        Index := Index - Large;
      kIndex := 1;
      while (kIndex < LengthPattern) and __SameChar(Index - kIndex, LengthPattern - kIndex) do
        Inc(kIndex);
      if kIndex = LengthPattern then begin
        // Found, return.
        //Result := Index - kIndex + 1;
        Index := Index + LengthPattern;
        fromPos := Index;
        Inc(Result);
        break;
      end else begin
        if __SameChar(Index, LengthPattern) then
          Index := Index + LastMarker
        else
          Index := Index + SkipTable[aSourceString[Index]];
      end; // if kIndex = LengthPattern then begin
    end; // while Index <= LengthString do begin

  end;
end; 

This is really a nice Algorithm, because:

  • it's way faster to count instances of substring X in string Y this way, magnificently so.
  • For merely replacing Pos() the _FindStringBoyer() is faster than the pure asm version of Pos() contributed to Delphi by FastCode project people, that is currently used for Pos, and if you need the case-insensitivity, you can imagine the performance boost when we don't have to call UpperCase on a 100 megabyte string. (Okay, your strings aren't going to be THAT big. But still, Efficient Algorithms are a Thing of Beauty.)

Okay I wrote a String Replace in Boyer-Moore style:

function _StringReplaceBoyer(const aSourceString, aFindString,aReplaceString : String; Flags: TReplaceFlags) : String;
var
  errors:Integer;
  fromPos:Integer;
  Limit:Integer;
  guard:Integer;
  SkipTable: array [Char] of Integer;
  LengthPattern: Integer;
  LengthString: Integer;
  Index: Integer;
  kIndex: Integer;
  LastMarker: Integer;
  Large: Integer;
  chPattern: Char;
  CaseSensitive:Boolean;
  foundAt:Integer;
  lastFoundAt:Integer;
  copyStartsAt:Integer;
  copyLen:Integer;
    function __SameChar(StringIndex, PatternIndex: Integer): Boolean;
    begin
      if CaseSensitive then
        Result := (aSourceString[StringIndex] = aFindString[PatternIndex])
      else
        Result := (CompareText(aSourceString[StringIndex], aFindString[PatternIndex]) = 0);
    end; // function __SameChar(StringIndex, PatternIndex: Integer): Boolean;

begin
  result := '';
  lastFoundAt := 0;
  fromPos := 1;
  errors := 0;
  CaseSensitive := rfIgnoreCase in Flags;
  Limit := Length(aSourceString);
  guard := Length(aFindString);
  Index := 0;
  LengthPattern := Length(aFindString);
  LengthString := Length(aSourceString);
  for chPattern := Low(Char) to High(Char) do
    SkipTable[chPattern] := LengthPattern;
  for Index := 1 to LengthPattern -1 do
    SkipTable[aFindString[Index]] := LengthPattern - Index;
  Large := LengthPattern + LengthString + 1;
  LastMarker := SkipTable[aFindString[LengthPattern]];
  SkipTable[aFindString[LengthPattern]] := Large;
  while (fromPos>=1) and (fromPos <= Limit) and (Index<=Limit) do begin

    Index := fromPos + LengthPattern -1;
    if Index>Limit then
        break;
    kIndex := 0;
    foundAt := 0;
    while Index <= LengthString do begin
      repeat
        Index := Index + SkipTable[aSourceString[Index]];
      until Index > LengthString;
      if Index <= Large then
        Break
      else
        Index := Index - Large;
      kIndex := 1;
      while (kIndex < LengthPattern) and __SameChar(Index - kIndex, LengthPattern - kIndex) do
        Inc(kIndex);
      if kIndex = LengthPattern then begin


        foundAt := Index - kIndex + 1;
        Index := Index + LengthPattern;
        //fromPos := Index;
        fromPos := (foundAt+LengthPattern);
        if lastFoundAt=0 then begin
                copyStartsAt := 1;
                copyLen := foundAt-copyStartsAt;
        end else begin
                copyStartsAt := lastFoundAt+LengthPattern;
                copyLen := foundAt-copyStartsAt;
        end;

        if (copyLen<=0)or(copyStartsAt<=0) then begin
                Inc(errors);
        end;

        Result := Result + Copy(aSourceString, copyStartsAt, copyLen ) + aReplaceString;
        lastFoundAt := foundAt;
        if not (rfReplaceAll in Flags) then
                 fromPos := 0; // break out of outer while loop too!
        break;
      end else begin
        if __SameChar(Index, LengthPattern) then
          Index := Index + LastMarker
        else
          Index := Index + SkipTable[aSourceString[Index]];
      end; // if kIndex = LengthPattern then begin
    end; // while Index <= LengthString do begin
  end;
  if (lastFoundAt=0) then
  begin
     // nothing was found, just return whole original string
      Result := aSourceString;
  end
  else
  if (lastFoundAt+LengthPattern < Limit) then begin
     // the part that didn't require any replacing, because nothing more was found,
     // or rfReplaceAll flag was not specified, is copied at the
     // end as the final step.
    copyStartsAt := lastFoundAt+LengthPattern;
    copyLen := Limit; { this number can be larger than needed to be, and it is harmless }
    Result := Result + Copy(aSourceString, copyStartsAt, copyLen );
  end;

end;

Okay, problem: Stack footprint of this:

var
  skiptable : array [Char] of Integer;  // 65536*4 bytes stack usage on Unicode delphi

Goodbye CPU hell, Hello stack hell. If I go for a dynamic array, then I have to resize it at runtime. So this thing is basically fast, because the Virtual Memory system on your computer doesn't blink at 256K going on the stack, but this is not always an optimal piece of code. Nevertheless my PC doesn't blink at big stack stuff like this. It's not going to become a Delphi standard library default or win any fastcode challenge in the future, with that kinda footprint. I think that repeated searches are a case where the above code should be written as a class, and the skiptable should be a data field in that class. Then you can build the boyer-moore table once, and over time, if the string is invariant, repeatedly use that object to do fast lookups.

Lx answered 22/7, 2010 at 16:5 Comment(22)
In my testing I found that the relative speed of fastcode Pos vs. BM versions depends on quite a few parameters (lengths, alphabet sizes, repetitiveness, substring location) so that it's not easy to generalise; some asm/SIMD-based implementations that are typically significantly faster than either of the above are possible, but I don't have a unicode-ready version at hand. And yes, a 100 Mb string isn't so unlikely when you're dealing with genomes. But then, at this stage people start to use different algorithms altogether.Waistline
Yep. But it would be nice if people could use efficient algorithms that are already well tested. Something like the above, when complete and tested, would be good to put into the JclStrings unit of the JCL library. Can you specify WHEN the fastcode Pos will be faster than a BoyerMoore algorithm? My tests are focusing on strings > 64K in length ONLY.Lx
Hand-optimized assembler versions of a boyer-moore type string replace that I have here are doing AnsiString search and replace about 30% faster on very large AnsiStrings. That it's THAT close is actually great, considering how much more "memory bandwidth" a giant UnicodeString needs.Lx
Alas no, my tests really don't constitute a sufficient sample of the parameter space for me to hazard a general prediction. What I can tell you is that for searches of random DNA substrings (~10-1000 char range) in a large random DNA string (Mb-Gb range) with randomised substring locations, FastCode Pos() tends to beat Boyer-Moore in the majority of cases. I got similar patterns with larger alphabet sizes also (e.g. 20-char alphabets or English language text).Waistline
Oh good. So Boyer-Moore could be a "pessimization" for general everyday use. Which fits with my general mental model of the FastCode project people as "bright sparks" who wouldn't miss a trick like this, if it was always a win-win algorithm. Anybody who wants the FastStrings code tweaked-to-run-in-D2010, and the above code in a zip format, just email me.Lx
I have several working SSE##-based SIMD asm brute-force search versions that beat both BM and Pos() easily for my purposes; on new CPUs, SSE4.2-based routines utilising PCMPxSTRx (which can handle 2-byte characters) are a distinct option. Maybe something for another fastcode challenge?Waistline
For large scale bioinformatics work where lots of short strings need to be repeatedly mapped to a larger reference string, people use e.g. suffix-array-based algorithms, often with Burrows-Wheeler transforms, which is great for both speed and memory use.Waistline
Wow Dude! Care to provide a sample SIMD implementation? :-)Lx
If you started checking out Delphi's StringReplace versus the above, you'd see there is probably no question that Delphi's StringReplace needs to be nuked and replaced with something smarter.Lx
@Warren P, I will have to go through the code to get it up to a level for it to be fit for sharing, and will write more thorough validations. But once that's done, I'll be happy to share it. I only have AnsiString versions, though.Waistline
I know this is pretty old, but thank you for writing the only Boyer-Moore implementation that I can find in Delphi that actually works, +1Seriema
Beware the high stack memory usage that can occur if you run this against some kinds of text.... (Imagine a chinese to english dictionary with 20,000 glyphs).Lx
Re the stack usage: Can you allocate it on the heap, once, the first time one of these methods is used, and free it when the program quits? Or is part of the speed dependent on it being located in the stack specifically?Phenica
This is really too much for the stack (default stack size is 1MB so a quarter of that is way too much). I would suggest to add a parameter where one may provide the memory or if nil the function does a getmem, freemem (so without memory initialization).Glisten
On modern windows systems, I really doubt you could even measure the effect of moving from stack to heap. If anything, it would be microscopically slower. Most machines have 8+ gigabytes of ram.Lx
Or pherhaps that's an instance where a threadvar would make sense.Glisten
The code in this answer has several errors. I had updated the answer with a first bug fix (28 aug), but I am currently tracing other bugs. I have removed my update again and will post a complety new answer if I have tested it enough.Pleadings
I have not succeeded in fixing the bug in the time I am willing to spend. The function fails with case insensitive search. I will add a warning to the end of the answer.Pleadings
The original question had no intention to support the case insensitive approach, and you retroactively adding it in is a mis-use of Stack overflow. I suggest you ask a new question if you want a CASE INSENSITIVE Boyer Moore. As far as I know, there is none. You just uppercase your inputs and uppercase your store. Don't amend a 4 year old question unless you wish to repair the code inline. Please try your edits again.Lx
I integrated your main fix into the main body of the answer Jan. Do not edit this code to add case insensitivity.Lx
I tested the above code in Delphi XE4 with ShowMessage(_StringCountBoyer('Hello...Hello..Hello..Hello', 'Hello').ToString()); and result is 2. Is the code working?Assessment
The case insensitive code above is indeed broken. Searches for strings like "Mike" "mike" "MIKE" yield different results as the skip table calcs do not factor in case insensitivity. I did find a solution here delphigeist.com/2010/04/… that appears to work in my limited testing.Odelet
G
2

Since I was just looking for the same: Jedi JCL has got a unicode aware search engine using Boyer-Moore in jclUnicode.pas. I have no idea how good or how fast it is yet.

Gazpacho answered 14/9, 2014 at 17:40 Comment(2)
I'd be curious to know if it uses a giant stack-based table or if it consumes mostly heap memory in operation.Lx
I haven't analyzed it really, but by a cursory look I'd say it uses the heap.Gazpacho

© 2022 - 2024 — McMap. All rights reserved.