Boyer-Moore Practical in C#?
Asked Answered
K

3

33

Boyer-Moore is probably the fastest non-indexed text-search algorithm known. So I'm implementing it in C# for my Black Belt Coder website.

I had it working and it showed roughly the expected performance improvements compared to String.IndexOf(). However, when I added the StringComparison.Ordinal argument to IndexOf, it started outperforming my Boyer-Moore implementation. Sometimes, by a considerable amount.

I wonder if anyone can help me figure out why. I understand why StringComparision.Ordinal might speed things up, but how could it be faster than Boyer-Moore? Is it because of the the overhead of the .NET platform itself, perhaps because array indexes must be validated to ensure they're in range, or something else altogether. Are some algorithms just not practical in C#.NET?

Below is the key code.

// Base for search classes
abstract class SearchBase
{
    public const int InvalidIndex = -1;
    protected string _pattern;
    public SearchBase(string pattern) { _pattern = pattern; }
    public abstract int Search(string text, int startIndex);
    public int Search(string text) { return Search(text, 0); }
}

/// <summary>
/// A simplified Boyer-Moore implementation.
/// 
/// Note: Uses a single skip array, which uses more memory than needed and
/// may not be large enough. Will be replaced with multi-stage table.
/// </summary>
class BoyerMoore2 : SearchBase
{
    private byte[] _skipArray;

    public BoyerMoore2(string pattern)
        : base(pattern)
    {
        // TODO: To be replaced with multi-stage table
        _skipArray = new byte[0x10000];

        for (int i = 0; i < _skipArray.Length; i++)
            _skipArray[i] = (byte)_pattern.Length;
        for (int i = 0; i < _pattern.Length - 1; i++)
            _skipArray[_pattern[i]] = (byte)(_pattern.Length - i - 1);
    }

    public override int Search(string text, int startIndex)
    {
        int i = startIndex;

        // Loop while there's still room for search term
        while (i <= (text.Length - _pattern.Length))
        {
            // Look if we have a match at this position
            int j = _pattern.Length - 1;
            while (j >= 0 && _pattern[j] == text[i + j])
                j--;

            if (j < 0)
            {
                // Match found
                return i;
            }

            // Advance to next comparision
            i += Math.Max(_skipArray[text[i + j]] - _pattern.Length + 1 + j, 1);
        }
        // No match found
        return InvalidIndex;
    }
}

EDIT: I've posted all my test code and conclusions on the matter at http://www.blackbeltcoder.com/Articles/algorithms/fast-text-search-with-boyer-moore.

Keshiakesia answered 5/2, 2011 at 2:15 Comment(10)
Jonathan, there's no such thing as "C#.NET".Minnick
Are you completely excluding the possibility that Boyer-Moore is employed in .net internally? Just curious.Ichthyic
See #2584669 and the comments under the accepted answer in particular.Haddad
@Nikita: No, and I spent that last 30 minutes trying to get .NET Reflector up after Red Gate has all but ruined it. However, it looks like it goes into a routine that .NET Reflector won't show. (Unmanaged code, perhaps?) But even if it did use BM, how could it be so much faster? I'm not even including the skip-array setup in my timing. Does unmanaged code offer that much performance benefit?Keshiakesia
@Hightechrider: Thanks for the reference, however it wasn't that much help. They basically speculate about IndexOf() using Boyer-Moore and talk about issues with Unicode (which I've already developed code to correctly handle).Keshiakesia
@John: How do you refer to C# running on .NET?Keshiakesia
FYI, .NET is Microsoft's implementation of C#, the CLR, and the various libraries. Mono is the other major implementation.Yamada
@Jonathan - It also explains that IndexOf uses native/internal code (which you can also see using Reflector). Since the framework resorts to native code you can bet that they have a faster implementation than you do in your managed code.Haddad
@Hightechrider: Yes, this seems to be the likely explanation. I spent a little time with Reflector. Although I don't understand it that well, I followed it down until it seemed to go to unmanaged code. It possibly uses hand-optimized assembly language, and StringComparison.Ordinal allows it to perform a direct comparision.Keshiakesia
note that performance really matters you can use unsafe code and work using a char*. Some internal string routines do that.Bashful
K
22

Based on my own tests and the comments made here, I've concluded that the reason String.IndexOf() performs so well with StringComparision.Ordinal is because the method calls into unmanaged code that likely employs hand-optimized assembly language.

I have run a number of different tests and String.IndexOf() just seems to be faster than anything I can implement using managed C# code.

If anyone's interested, I've written everything I've discovered about this and posted several variations of the Boyer-Moore algorithm in C# at http://www.blackbeltcoder.com/Articles/algorithms/fast-text-search-with-boyer-moore.

Keshiakesia answered 6/2, 2011 at 21:43 Comment(4)
For others who may find it is useful, String.Contains() calls String.IndexOf(value, StringComparison.Ordinal) >= 0 . So the conclution is to use String.Conatinst() for string searchChanda
This conclusion is data dependent. Boyer Moore can be arbitrarily faster on suitable data.Berkeleian
@usr: Well, if you have evidence that it is faster in some case, please feel free to present that evidence. I have a thorough understanding of Boyer-Moore, and fully understand how it is faster for certain data, but I tested a variety of data and String.IndexOf() seemed faster pretty much consistently.Keshiakesia
@ValidfroM: string.Contains() does not tell you where the string is located. So it wouldn't be a good choice at all for string search.Keshiakesia
I
4

My bet is that setting that flag allows String.IndexOf to use Boyer-Moore itself. And its implementation is better than yours.

Without that flag it has to be careful using Boyer-Moore (and probably doesn't) because of potential issues around Unicode. In particular the possibility of Unicode causes the transition tables that Boyer-Moore uses to blow up.

Ihram answered 5/2, 2011 at 2:27 Comment(10)
Well, that would be a neat trick given that I'm using some pretty standard implementations. (They aren't mine, BTW.) However, if it runs in unmanaged code, that may offer some performance benefits.Keshiakesia
If the BCL is using a Boyer-Moore search, it should be detectable; searching for a long string ("abcdefghijklmnopqrstuvwxyz") should be measurably faster than searching for a short string ("a").Yamada
@Gabe: Good point, but it doesn't seem to be. It just seems fast no matter what. My Boyer-Moore routines, on the other hand, are definitely affected by the length of the search pattern.Keshiakesia
Hmmm, could it be .NET uses different algorithms for different cases?Kilimanjaro
It's quite likely that the .Net implementation is mostly just a single instruction (REP CMPSB), making it go as fast as data can flow through the CPU.Yamada
@Gabe: This is the type of stuff I've been wondering about. It seems likely that IndexOf() uses some optimizations that are not available to me as a C# programmer. It makes sense that CMPSW can only be used with StringComparison.Ordinal, which would explain why it's considerably slower in other comparison modes.Keshiakesia
There's no reason for you to have to bet here. Has anyone opened it up in Reflector and checked?Wintergreen
Actually, @billy-oneal, there is a good reason for me to have to bet. I don't use Windows or C# so testing directly is impractical for me. But I know about Boyer-Moore and its potential pitfalls from other languages in other environments.Ihram
@Yamada the rep instructions are by no means the fastest instructions.Idalia
@Mehrdad: Feel free to substitute in whatever .Net actually uses.Yamada
B
0

I made some small changes to your code, and made a different implementation to the Boyer-Moore algorithm and got better results. I got the idea for this implementation from here

But to be honest, I would expect to reach a higher speed compared to IndexOf.

enter image description here

class SearchResults
{
    public int Matches { get; set; }
    public long Ticks { get; set; }
}

abstract class SearchBase
{
    public const int InvalidIndex = -1;
    protected string _pattern;
    protected string _text;
    public SearchBase(string text, string pattern) { _text = text; _pattern = pattern; }
    public abstract int Search(int startIndex);
}

internal class BoyerMoore3 : SearchBase
{
    readonly byte[] textBytes;
    readonly byte[] patternBytes;
    readonly int valueLength;
    readonly int patternLength;
    private readonly int[] badCharacters = new int[256];
    private readonly int lastPatternByte;

    public BoyerMoore3(string text, string pattern) : base(text, pattern)
    {
        textBytes = Encoding.UTF8.GetBytes(text);
        patternBytes = Encoding.UTF8.GetBytes(pattern);
        valueLength = textBytes.Length;
        patternLength = patternBytes.Length;

        for (int i = 0; i < 256; ++i)
            badCharacters[i] = patternLength;

        lastPatternByte = patternLength - 1;

        for (int i = 0; i < lastPatternByte; ++i)
            badCharacters[patternBytes[i]] = lastPatternByte - i;
    }

    public override int Search(int startIndex)
    {
        int index = startIndex;

        while (index <= (valueLength - patternLength))
        {
            for (int i = lastPatternByte; textBytes[index + i] == patternBytes[i]; --i)
            {
                if (i == 0)
                    return index;
            }

            index += badCharacters[textBytes[index + lastPatternByte]];
        }

        // Text not found
        return InvalidIndex;
    }
}

Changed code from Form1:

    private void RunSearch(string pattern, SearchBase search, SearchResults results)
    {
        var timer = new Stopwatch();

        // Start timer
        timer.Start();

        // Find all matches
        int pos = search.Search(0);
        while (pos != -1)
        {
            results.Matches++;
            pos = search.Search(pos + pattern.Length);
        }

        // Stop timer
        timer.Stop();

        // Add to total Ticks
        results.Ticks += timer.ElapsedTicks;
    }
Brickbat answered 20/12, 2020 at 12:39 Comment(3)
It looks like you only have 256 entries in your skip table. That's enough for ASCII, but not Unicode.Keshiakesia
@JonathanWood Did you notice that the string is converted to a byte array by UTF8?Brickbat
@JonathanWood and see hereBrickbat

© 2022 - 2024 — McMap. All rights reserved.