strstr faster than algorithms?

Asked 28/9, 2011 at 17:13 Answered 22/10, 2011 at 6:21

Solved c performance algorithm string-matching strstr

I have a file that's 21056 bytes.

I've written a program in C that reads the entire file into a buffer, and then uses multiple search algorithms to search the file for a token that's 82 chars.

I've used all the implementations of the algorithms from the “Exact String Matching Algorithms” page. I've used: KMP, BM, TBM, and Horspool. And then I used strstr and benchmarked each one.

What I'm wondering is, each time the strstr outperforms all the other algorithms. The only one that is faster sometimes is BM.

Shouldn't strstr be the slowest?

Here's my benchmark code with an example of benchmarking BM:

double get_time()
{
    LARGE_INTEGER t, f;
    QueryPerformanceCounter(&t);
    QueryPerformanceFrequency(&f);
    return (double)t.QuadPart/(double)f.QuadPart;
}

before = get_time();
BM(token, strlen(token), buffer, len);
after = get_time();
printf("Time: %f\n\n", after - before);

Could someone explain to me why strstr is outperforming the other search algorithms? I'll post more code on request if needed.

Zanazander answered 28/9, 2011 at 17:13 Comment(2)

The error is in your code. At least Horspool should consistently outperform strstr – KMP may actually be slower. But since you didn’t post your code, we can’t help you. That said, you can chose your data deliberately to make the naive search win so the choice of input data is also relevant. – Tristan 28/9, 2011 at 17:15

Are you seriously benchmarking searching for an 80 byte string in a 20k string?... Your sample size is so small you could do it by hand! – Daryn 28/9, 2011 at 17:18

Why do you think strstr should be slower than all the others? Do you know what algorithm strstr uses? I think it's quite likely that strstr uses a fine-tuned, processor-specific, assembly-coded algorithm of the KMP type or better. In which case you don't stand a chance of out-performing it in C for such small benchmarks.

(The reason I think this is likely is that programmers love to implement such things.)

Stoppage answered 28/9, 2011 at 17:20 Comment(21)

It’s almost guaranteed that strstr uses the naive algorithm. It performs well enough, and, this is the important part, it’s general purpose. KMP, for instance, sometimes performs much worse so it certainly isn’t used. – Tristan 28/9, 2011 at 17:22

Huh, always thought strstr used some sort of brute force. – Zanazander 28/9, 2011 at 17:33

@Zanazander It does, don’t be misled. – Tristan 28/9, 2011 at 17:34

IIRC Glibc has optimised asm functions for popular platforms and functions, and (more or less) naive fall-back implementations in C. – Micro 28/9, 2011 at 17:53

@Konrad: Look at Banthar's link. It supports my claim. And what's this about KMP sometimes performing much worse than the naive algorithm? I find that impossible to believe. Can you provide a reference? – Stoppage 28/9, 2011 at 17:59

@Josh: It doesn't, don't be misled! – Stoppage 28/9, 2011 at 18:0

TonyK: I suggest editing your answer to include @Banthar's link. Makes it obvious that you are right (well, it looks more like BM than KMP, but still). – Splint 28/9, 2011 at 20:55

@Stoppage KMP has good theoretical characteristics in terms of the number of comparisons but in practice it’s slower than the naive search since its operations are more complex. This is well known but if you need a reference, consider the standard work on this matter, Flexible Pattern Matching, page 1 by Navarro and Raffinot (the link is buggy, sorry – try pressing “Previous” and “Next” in the search results to get it to display correctly). Incidentally, this also shows that BM is in practice slower than Horspool. – Tristan 29/9, 2011 at 6:44

@Banthar Colour me surprised. I had looked at multiple strstr implementations (I’ve worked extensively on string search implementations over the last few years) and every implementation I saw was the naive search – which is a good choice, by the way (see above). I’m also surprised that they used BM instead of Horspool for long needles since the latter is known to perform better on average in typical situations (again, see above). – Tristan 29/9, 2011 at 6:51

TonyK, I agree with your answer, but for it to be more futureproof, it should not mention what kind of technique strstr() is implemented. Just something "likely to be very fast on the platform of choice". – Bollen 29/9, 2011 at 10:52

@Amigable: OK, I changed it slightly. – Stoppage 29/9, 2011 at 11:8

And that tipped the scale. +1 I like this answer very much. :) – Bollen 29/9, 2011 at 11:10

@Konrad: When you said "much worse", I thought you meant "worse than O(mn)". I agree that naive search is likely to be faster than KMP for small strings. But then a good implementation will use a naive algorithm for such strings anyway. – Stoppage 29/9, 2011 at 11:11

@Stoppage It will be faster for small needles, regardless of the length of the haystack – this is important, because it’s the most common use-case by far: your text may be very large, but as long as the needle is relatively small, naive search will be faster than KMP. – Tristan 29/9, 2011 at 11:22

@Konrad: You're still wrong, it's not BM -- as it says on line 1 it uses the Two-Way Algorithm, which amazingly takes only linear time and constant space, unlike BM which requires space proportional to the needle size, or BMH which is not linear-time on some inputs: igm.univ-mlv.fr/~lecroq/string/node26.html – Mayce 29/9, 2011 at 11:46

@Konrad: When line 50 mentions the "bad character shift table", that should tell you that the sentence really means it's even more like BMH than BM (since both use that table, but BMH uses only that table). – Mayce 29/9, 2011 at 11:52

@Mayce Ah, so it does. I had only skimmed the code itself and surmised that two-way would only be used for short needles. This is indeed a sensible choice although I still think Horspool could be faster in many realistic scenarios (and Navarro & Raffinot also seem to think this). But you are wrong about the constant space: two-way does need space linear in the size of the needle, otherwise its worst case is quadratic, not linear! And the glibc implementation also uses linear space for large needles. – Tristan 29/9, 2011 at 11:56

@Konrad: Where do you get the idea that it needs quadratic time from? Also, the glibc implementation needs O(alphabet_size) space for large needles, not space linear in the size of the needle. – Mayce 29/9, 2011 at 12:9

@Mayce I’ll shut up now. – Tristan 29/9, 2011 at 12:20

@Konrad: Ah, I see the algorithm description was worded a bit confusingly on the page I linked. But it is indeed a weakness-free algorithm: constant-time and linear-space. That it exists should be better known and appreciated! :) What's more, it's right where it should be: in the library implementation of strstr(). :) – Mayce 29/9, 2011 at 12:32

Whoops, I meant "linear-time and constant-space"! – Mayce 3/10, 2011 at 5:30

Horspool, KMP et al are optimal at minimizing the number of byte-comparisons.

However, that's not the bottleneck on a modern processor. On an x86/64 processor, your string is being loaded into L1 cache in cache-line-width chunks (typically 64 bytes). No matter how clever your algorithm is, unless it gives you strides that are larger than that, you gain nothing; and the more complicated Horspool code is (at least one table lookup) can't compete.

Furthermore, you're stuck with the "C" string constraint of null-termination: SOMEWHERE the code has to examine every byte.

strstr() is expected to be optimal for a wide range of cases; e.g. searching for tiny strings like "\r\n" in a short string, as well as much longer ones where some smarter algorithm might have a hope. The basic strchr/memcmp loop is pretty hard to beat across the whole range of likely inputs.

Pretty much all x86-compatible processors since 2003 have supported SSE2. If you disassembled strlen()/x86 for glibc, you may have noticed that it uses some SSE2 PCMPEQ and MOVMASK operations to search for the null terminator 16 bytes at a time. The solution is so efficient that it beats the obvious super-simple loop, for anything longer than the empty string.

I took that idea and came up with a strstr() that beats glibc's strstr() for all cases greater than 1 byte --- where the relative difference is pretty much moot. If you're interested, check out:

Convergence SSE2 and strstr()
A better strstr() with no ASM code

If you care to see a non-SSE2 solution that dominates strstr() for target strings of more than 15 bytes, check out:

which makes use of multibyte comparisons rather than strchr(), to find a point at which to do a memcmp.

BTW you've probably figured out by now that the x86 REP SCASB/REP CMPSB ops fall on their ass for anything longer than 32 bytes, and are not much improvement for shorter strings. Wish Intel had devoted a little more attention to that, than to adding SSE4.2 "string" ops.

For strings large enough to matter, my perf tests show that BNDM is better than Horspool, across the board. BNDM is more tolerant of "pathological" cases, such as targets that heavily repeat the last byte of a pattern. BNDM can also make use of SSE2 (128bit registers) in a way that competes with 32bit registers in efficiency and start-up cost. Source code here.

Debris answered 22/10, 2011 at 6:21 Comment(4)

Have you tried submitting a patch to the glibc? Of course, this requires careful benchmark over a wide range of different inputs but your above explanation seems sound. – Tristan 29/10, 2011 at 10:30

so given the cache optimization, is it a waste of time to try to outperform the naive search with Horspool? – Stonefish 5/3, 2013 at 3:40

@Mischa, Any advise how to compile this code for windows? I have no idea what is ffs and where to get it. – Gerladina 5/4, 2013 at 12:1

Wow, so sorry I never answered that. For the record, the MSVC equivalent is _BitScanForward(), but it's a bit clunky. I wrap it as (pardon the code in comments please, O Editor) static inline uint32_t intlowz(uint32_t x) { uint32_t pos; return x ? (_BitScanForward(&pos, x), pos) : 32; } – Debris 27/5, 2017 at 23:0

Without seeing your code, it's hard to say exactly. strstr is heavily optimized, and usually written in assembly language. It does things like reading data 4 bytes at a time and comparing them (bit-twiddling if necessary if the alignment isn't right) to minimize memory latency. It can also take advantage of things like SSE to load 16 bytes at a time. If your code is only loading one byte at a time, it's probably getting killed by memory latency.

Use your debugger and step through the disassembly of strstr -- you'll probably find some interesting things in there.

Sidonius answered 28/9, 2011 at 17:21 Comment(3)

The effects of 4-byte-comparisons and SSE on string search are quite limited, unfortunately since there exist linear dependencies between the elements. In particular, you can’t just compare in non-overlapping 4-byte blocks: even if you compare 4 bytes at a time, you still need to compare overlapping blocks, not gaining much. – Tristan 29/9, 2011 at 6:56

@Konrad: strstr can basically be implemented as a loop around strchr and strcmp/memcmp (with memcmp requiring an initial strlen). strchr can be implemented with some bit-twiddling tricks and doing 4-byte loads. memcmp can load 4 bytes at a time from each of the two comparands and compare those words. If the two operands don't have the same alignment, you keep track of two words at a time and bit-twiddle them to get the misaligned word to compare against the other string. See sysdeps/x86_64/memcmp.S in glibc, for example. – Sidonius 29/9, 2011 at 17:35

@Konrad: you gain something comparing the first 4 bytes of a pattern against a 4byte sliding window, shifting and inserting the next target string byte into the bottom of a 32bit word. It's not a much as you'd wish, since mixing 8 and 32 bit instructions on the same register stalls the instruction pipeline. SSE's a different story: you can compare 16 bytes from the target string with the first byte of the pattern, replicated 16 times. This seriously rocks. See my post below re an article with more details. – Debris 28/10, 2011 at 18:48

Imagine you want to get something cleaned. You could just clean it yourself, or you could hire ten professional cleaners to clean it. If the cleaning job is an office building, the latter solution would be preferable. If the cleaning job was one window, the former would be preferable.

You never get any payback for the time spent setting up to do the job efficiently because the job doesn't take very long.

Selfservice answered 28/9, 2011 at 17:19 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags