Optimized version of strstr (search has constant length)
Asked Answered
T

5

13

My C program had a lot of strstr function calls. The standard library strstr is already fast but in my case the search string has always length of 5 characters. I replaced it with a special version to gain some speed:

int strstr5(const char *cs, const char *ct)
{
    while (cs[4]) {

        if (cs[0] == ct[0] && cs[1] == ct[1] && cs[2] == ct[2] && cs[3] == ct[3] && cs[4] == ct[4])
            return 1;

        cs++;
    }

    return 0;
}

The function returns an integer because it’s enough to know if ct occurs in cs. My function is simple and faster than standard strstr in this special case but I’m interested to hear if anybody has some performance improvements that could be applied. Even small improvements are welcome.

Summary:

  • cs has length of >=10, but otherwise it can vary. Length is known before (not used in my function). Length of cs is usually from 100 to 200.
  • ct has length of 5
  • Content of strings can be anything

Edit: Thank you for all answers and comments. I have to study and test ideas to see what works best. I will start with MAK's idea about suffix trie.

Tallyho answered 27/6, 2010 at 19:10 Comment(6)
Will you call the function frequently with the same value of cs? of ct?Manmade
Value of cs if frequently the same. ct changes every time.Tallyho
You can't validly name your function strstr5(), the implementation reserves all function names that start with "str" followed by a lower-case letter.Bashibazouk
How often will you call this function with identical cs?Malek
Thanks for the comment about function name. I did not think about it.Tallyho
cs stays the same until all subsequences of other string are done. ct is a 5 character substring of a string with similar properties as cs. So cs is compared with all possible 5 character subsequences of similar string.Tallyho
D
15

There are several fast string search algorithms. Try looking at Boyer-Moore (as already suggested by Greg Hewgill), Rabin-Karp and KMP algorithms.

If you need to search for many small patterns in the same large body of text, you can also try implementing a suffix tree or a suffix array. But these are IMHO somewhat harder to understand and implement correctly.

But beware, these techniques are very fast, but only give you an appreciable speedup if the strings involved are very large. You might not see an appreciable speedup for strings less than say a 1000 characters long.

EDIT:

If you are searching on the same text over and over again (i.e. the value of cs is always/often the same across calls), you will get a big speedup by using a suffix trie (Basically a trie of suffixes). Since your text is as small as 100 or 200 characters, you can use the simpler O(n^2) method to build the trie and then do multiple fast searches on it. Each search would require only 5 comparisons instead of the usual 5*200.

Edit 2:

As mentioned by caf's comment, C's strstr algorithm is implementations dependent. glibc uses a linear time algorithm which should be more or less as fast in practice as any of the methods I've mentioned. While the OP's method is asymptotically slower (O(N*m) instead of O(n) ), it is faster probably due to the fact that both n and m (the lengths of the pattern and the text) are very small and it does not have to do any of the long preprocessing in the glibc version.

Dogma answered 27/6, 2010 at 19:34 Comment(4)
Thank you for your answer. In my case cs is relatively short. I updated my question again. Looks like that I forgot to mention important points in my question. Looks like I might stick with simple code as also joe snyder pointed out.Tallyho
The C standard does not specify which algorithm should be used for strstr() - it only specifies the functionality. glibc at least uses the linear complexity Two-Way algorithm: sourceware.org/git/?p=glibc.git;a=blob;f=string/…Pigeonhole
@caf: Thanks for pointing that out. I didn't know glibc used a O(n) algorithm.Dogma
For what it's worth, glibc's O(n) algorithm is slower than the naive O(nm) algorithm for needle lengths up to about 40 bytes. Even more ridiculous, they special-case needles shorter than 32 bytes, but with a bad variant of Two-Way that's slower in all cases. Performance would be at least 2-3 times better in all real-world uses (and never worse than it is now) if they replaced the short needle case with the naive algorithm.Thirzi
A
12

Reducing the number of comparisons will increase the speed of the search. Keep a running int of the string and compare it to a fixed int for the search term. If it matches compare the last character.

uint32_t term = ct[0] << 24 | ct[1] << 16 | ct[2] << 8 | ct[3];
uint32_t walk = cs[0] << 24 | cs[1] << 16 | cs[2] << 8 | cs[3];
int i = 0;

do {
  if ( term == walk && ct[4] == cs[4] ) { return i; } // or return cs or 1
  walk = ( walk << 8 ) | cs[4];
  cs += 1;
  i += 1;
} while ( cs[4] ); // assumes original cs was longer than ct
// return failure

Add checks for a short cs.

Edit:

Added fixes from comments. Thanks.

This could easily be adopted to use 64 bit values. You could store cs[4] and ct[4] in local variables instead of assuming the compiler will do that for you. You could add 4 to cs and ct before the loop and use cs[0] and ct[0] in the loop.

Abrupt answered 27/6, 2010 at 19:46 Comment(5)
+1, this is basically the same idea as a Rabin-Karp. The variable walk acts as a rolling hash.Dogma
<< 0 isn't needed. i isn't needed; return 1 on a match or return 0 if you finish the loop. you can also test cs[4] at the end of the loop instead of the start, in case the first loop succeeds, since cs min length is guaranteed.Subtitle
this will only be faster in certain conditions, ie, (probably only when there are many similar strings where only character at index 4 doesn't match. Since most of the time his original, if it doesn't match, it moves on, where as this always does a computation as well as a comparison (assuming we get rid of the i).Garlan
You must use int32_t to be portableFelt
You mean uint32_t. Result of left shift on signed int is undefined when it overflows.Thirzi
T
5

strstr's interface impose some constraints that can be beaten. It takes null-terminated strings, and any competitor that first does a "strlen" of its target will lose. It takes no "state" argument, so set-up costs can't be amortized across many calls with (say) the same target or pattern. It is expected to work on a wide range of inputs, including very short targets/patterns, and pathological data (consider searching for "ABABAC" in a string of "ABABABABAB...C"). libc is also now platform-dependent. In the x86-64 world, SSE2 is seven years old, and libc's strlen and strchr using SSE2 are 6-8 time faster than naive algorithms. On Intel platforms that support SSE4.2, strstr uses the PCMPESTRI instruction. But you can beat that, too.

Boyer-Moore's (and Turbo B-M, and Backward Oracle Matching, et al) have set-up time that pretty much knock them out of the running, not even counting the null-terminated-string problem. Horspool is a restricted B-M that works well in practice, but doesn't do the edge cases well. Best I've found in that field is BNDM ("Backward Nondeterministic Directed-Acyclic-Word-Graph Matching"), whose implementation is smaller than its name :-)

Here are a couple of code snippets that might be of interest. Intelligent SSE2 beats naive SSE4.2, and handles the null-termination problem. A BNDM implementation shows one way of keeping set-up costs. If you're familiar with Horspool, you'll notice the similarity, except that BNDM uses bitmasks instead of skip-offsets. I'm about to post how to solve the null-terminator problem (efficiently) for suffix algorithms like Horspool and BNDM.

A common attribute of all good solutions is splitting into different algorithms for different argument lengths. An example of is Sanmayce's "Railgun" function.

Treehopper answered 5/2, 2012 at 18:6 Comment(0)
C
3

Your code may access cs beyond the bounds of its allocation if cs is shorter than 4 characters.

A common optimisation for string search is to use the Boyer-Moore algorithm where you start looking in cs from the end of what would be ct. See the linked page for a full description of the algorithm.

Crossbench answered 27/6, 2010 at 19:14 Comment(2)
Keep in mind that the setup for Boyer-Moore might be expensive if cs is short and always changing. In that case, simpler code like yours might be faster (although it could still use some tweaking, such as moving the if..return 0 to after cs++ since it can't be true the first time since cs minimum is 10). Be sure to benchmark to measure what is really the fastest solution for your actual inputs.Subtitle
Boyer-Moore will almost certainly be slower. Since the needle size is fixed at 5, both are O(n), and Boyer-Moore only really helps when the needle is long.Thirzi
S
3

You won't beat a good implementation on a modern x86 computer.

New Intel processors have an instruction that takes 16 bytes of the string you are examining, up to 16 bytes of the search string, and in a single instruction returns which is the first byte position where the search string could be (or if there is none). For example if you search for "Hello" in the string "abcdefghijklmnHexyz" the first instruction will tell you that the string "Hello" might start at offset 14 (because reading 16 bytes, the processor has the bytes H, e, unknown which might be the location of "Hello". The next instruction starting at offset 14 then tells that the string isn't there. And yes, it knows about trailing zero bytes.

That's two instructions to find that a five character string is not present in a 19 character string. Try beating that with any special case code. (Obviously this is built specifically for strstr, strcmp and similar instructions).

Samara answered 2/12, 2015 at 17:25 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.