C++ string::find complexity
Asked Answered
A

7

44

Why doesn't c++'s implementation of string::find() use the KMP algorithm (and doesn't run in O(N + M)) and runs in O(N * M)? Is that corrected in C++0x? If the complexity of current find is not O(N * M), what is that?

so what algorithm is implemented in gcc? is that KMP? if not, why? I've tested that and the running time shows that it runs in O(N * M)

Ambert answered 15/1, 2012 at 12:27 Comment(3)
I am the victim of string::find, see my post: #19507071Discounter
how did you test it? I test it using gcc4.90. It seems the latest version of string.find is O(N+M)Kursh
@Kursh in g++ 9.2.1, (std::string(n-1,'a')+"b"+std::string(n,'a')).find(std::string(n,'a')) is O(n^2) or worse. E.g. for n = 1 million, it takes 21 seconds on my computer, but should be instant (a few milliseconds, perhaps).Incarnate
L
45

Why the c++'s implemented string::substr() doesn't use the KMP algorithm (and doesn't run in O(N + M)) and runs in O(N * M)?

I assume you mean find(), rather than substr() which doesn't need to search and should run in linear time (and only because it has to copy the result into a new string).

The C++ standard doesn't specify implementation details, and only specifies complexity requirements in some cases. The only complexity requirements on std::string operations are that size(), max_size(), operator[], swap(), c_str() and data() are all constant time. The complexity of anything else depends on the choices made by whoever implemented the library you're using.

The most likely reason for choosing a simple search over something like KMP is to avoid needing extra storage. Unless the string to be found is very long, and the string to search contains a lot of partial matches, the time taken to allocate and free that would likely be much more than the cost of the extra complexity.

Is that corrected in c++0x?

No, C++11 doesn't add any complexity requirements to std::string, and certainly doesn't add any mandatory implementation details.

If the complexity of current substr is not O(N * M), what is that?

That's the worst-case complexity, when the string to search contains a lot of long partial matches. If the characters have a reasonably uniform distribution, then the average complexity would be closer to O(N). So by choosing an algorithm with better worst-case complexity, you may well make more typical cases much slower.

Locomotive answered 15/1, 2012 at 12:42 Comment(7)
I know, but why they don't implement the faster algorithm(KMP)?Ambert
@Farzam: (a) it's harder to implement correctly; (b) it requires memory allocation, and only has lower worst-case complexity, so in practice will probably be slower for most common use cases.Locomotive
but that's faster for large Ns. they can do the O(N * M) for small Ns and KMP for large Ns. It also requires O(N) memory so the overall memory order doesn't change.Ambert
@Farzam: KMP is faster for search strings that contain many partial matches, and slower for search strings that don't, regardless of size of M and N. Reliably estimating which algorithm would be faster would be just as complex as performing the search itself, and I don't think that complicating the algorithm with an unreliable estimate would benefit anyone. But of course I haven't studied this as much as those who implemented the libraries, so I can't really comment on exactly why at least some of them chose a simple implementation, beyond the observations I've already made.Locomotive
@Ambert Having implemented all of the STL algorithms I'd think the basic reasons why it isn't generally done are: 1. memory allocation, even if it is just in the order of the inputs, is something you just don't want to do if you can help it in a general algorithm, 2. there doesn't seem to be a lot of interest in using std::search() and time is better invested improving other algorithms (I feel as if I sound like PJ!), 3. for the expected use cases the O(n * m) seems to have better or at least acceptable performance. Admittely, I haven't implemented KMP to test it, though.Damal
I have just implemented a first version of KMP search and found yet another complication when using it for std::search(): while std::search() is supported on forward iterators KMP search uses arrays quite heavily and I don't, yet, see how to avoid advancing iterators (well, the version I currently have don't even try: it requires random access iterators) although I think the resulting complexity should still be linear (I haven't entirely convinced myself about this, however).Damal
"So by choosing an algorithm with better worst-case complexity, you may well make more typical cases much slower." There are several problems with this argument. (1) "typical" isn't a robust concept (2) suboptimality of naive algorithm in worst case is much, much worse than suboptimality of smart algorithm in any case (3) there's no need to make these cases enemies of each other: just do the naive algorithm, counting ops; if count gets too big, switch to smart algorithm.Incarnate
H
18

FYI, The string::find in both gcc/libstdc++ and llvm/libcxx were very slow. I improved both of them quite significantly (by ~20x in some cases). You might want to check the new implementation:

GCC: PR66414 optimize std::string::find https://github.com/gcc-mirror/gcc/commit/fc7ebc4b8d9ad7e2891b7f72152e8a2b7543cd65

LLVM: https://reviews.llvm.org/D27068

The new algorithm is simpler and uses hand optimized assembly functions of memchr, and memcmp.

It is incorrect to assume that KMP and Boyer Moore algorithms are always faster just because their time complexity is optimal based on algorithmic analysis. The latest versions of string::find are way faster for cases where the 'needle' has fewer mismatches. The KMP algorithm optimizes 'skipping' of redundant checks so unless a string has many redundant skips it can't beat the existing implementations.

Another thing to note is that memchr is a sub-linear algorithm in the sense it can find a matching character in O(N/k) k being the 'vector factor' of the processor'. Modern processors are great at pre-fetching simple memory access patterns, and this helps memchr. KMP algorithm can offer no such advantage.

Honorable answered 10/1, 2017 at 18:4 Comment(0)
H
10

Where do you get the impression from that std::string::substr() doesn't use a linear algorithm? In fact, I can't even imagine how to implement in a way which has the complexity you quoted. Also, there isn't much of an algorithm involved: is it possible that you are think this function does something else than it does? std::string::substr() just creates a new string starting at its first argument and using either the number of characters specified by the second parameter or the characters up to the end of the string.

You may be referring to std::string::find() which doesn't have any complexity requirements or std::search() which is indeed allowed to do O(n * m) comparisons. However, this is a giving implementers the freedom to choose between an algorithm which has the best theoretical complexity vs. one which doesn't doesn't need additional memory. Since allocation of arbitrary amounts of memory is generally undesirable unless specifically requested, this seems a reasonable thing to do.

Heptagon answered 15/1, 2012 at 12:32 Comment(0)
F
2

Let's look into the CLRS book. On the page 989 of third edition we have the following exercise:

Suppose that pattern P and text T are randomly chosen strings of length m and n, respectively, from the d-ary alphabet Ʃd = {0; 1; ...; d}, where d >= 2. Show that the expected number of character-to-character comparisons made by the implicit loop in line 4 of the naive algorithm is enter image description here
over all executions of this loop. (Assume that the naive algorithm stops comparing characters for a given shift once it finds a mismatch or matches the entire pattern.) Thus, for randomly chosen strings, the naive algorithm is quite efficient.

NAIVE-STRING-MATCHER(T,P)
1 n = T:length
2 m = P:length
3 for s = 0 to n - m
4     if P[1..m] == T[s+1..s+m]
5         print “Pattern occurs with shift” s

Proof:

For a single shift we are expected to perform 1 + 1/d + ... + 1/d^{m-1} comparisons. Now use summation formula and multiply by number of valid shifts, which is n - m + 1. □

Fiddling answered 23/3, 2017 at 4:44 Comment(1)
I like CLRS. :-)Suberin
A
1

The C++ standard does not dictate the performance characteristics of substr (or many other parts, including the find you're most likely referring to with an M*N complexity).

It mostly dictates functional aspects of the language (with some exceptions like the non-legacy sort functions, for example).

Implementations are even free to implement qsort as a bubble sort (but only if they want to be ridiculed and possibly go out of business).

For example, there are only seven (very small) sub-points in section 21.4.7.2 basic_string::find of C++11, and none of them specify performance parameters.

Atmospherics answered 15/1, 2012 at 12:35 Comment(2)
I'm not sure about qsort() but std::sort() is required to have complexity O(n * ln(n)) in C++2011. In C++2003 it was allowed to have a complexity O(n * n) to have quicksort be a viable choice for an algorithm but it was shown that you can get similar normal case performance with introsort as you would get for quicksort while having a worst case complexity of O(n * ln(n)). An implementer is free to use a different algorithm with the same worst case complexity.Damal
qsort doesn't have requirements like that, it's purely functional, same as in C. More importantly, while some parts of C++ do have requirements like that, substr and find are not part of that subset. But you raised a good point on the non-legacy sort stuff, so I added that to the answer.Atmospherics
A
1

Where do you get your information about the C++ library? If you do mean string::search and it really doesn't use the KMP algorithm then I suggest that it is because that algorithm isn't generally faster than a simple linear search due to having to build a partial match table before the search can proceed.

Alexio answered 15/1, 2012 at 12:51 Comment(2)
but building the match table can also be done in linear time. (for example if N, M = 10^6: N * M = 10^12 but KMP will do about 10 * 10^6 operations which is about 10^5 times faster than the other one.Ambert
@Farzam: So how often does the N=M=10^6 case occur in practice? I would think the typical case may be more like N=100, M=5. Also most searchstring will probably fail the comparison after 1 or 2 chars, so the performance will be more like O(N). So for typical cases the overhead of more complex methods may be substantialNativity
D
1

If you are going to be searching for the same pattern in multiple texts. The BoyerMoore algorithm is a good choice because the pattern tables need only be computed once , but are used multiple times when searching multiple texts. If you are only going to search for a pattern once in 1 text though, the overhead of computing the tables along with the overhead of allocating memory slows you down too much and std::string.find(....) will beat you since it does not allocate any memory and has no overhead. Boost has multiple string searching algorithms. I found that BM was an order of magnitude slower in the case of a single pattern search in 1 text than std::string.find(). For my cases BoyerMoore was rarely faster than std::string.find() even when searching multiple texts with the same pattern. Here is the link to BoyerMoore BoyerMoore

Darter answered 7/10, 2019 at 17:9 Comment(1)
exactly the reason why I implemented it that way in C++ standard libraries.Honorable

© 2022 - 2024 — McMap. All rights reserved.