Python string 'in' operator implementation algorithm and time complexity

Asked 9/8, 2013 at 3:30 Answered 20/4 at 21:31

I am thinking of how the in operator implement, for instance

>>> s1 = 'abcdef'
>>> s2 = 'bcd'
>>> s2 in s1
True

In CPython, which algorithm is used to implement the string match, and what is the time complexity? Is there any official document or wiki about this?

Aile answered 9/8, 2013 at 3:30 Comment(1)

Can anyone tell me why KMP wasn't used for this implementation? – Lundin 17/9, 2019 at 18:29

It's a combination of Boyer-Moore and Horspool.

You can view the C code here:

Fast search/count implementation, based on a mix between Boyer-Moore and Horspool, with a few more bells and whistles on the top. For some more background, see: https://web.archive.org/web/20201107074620/http://effbot.org/zone/stringlib.htm.

From the link above:

When designing the new algorithm, I used the following constraints:

should be faster than the current brute-force algorithm for all test cases (based on real-life code), including Jim Hugunin’s worst-case test

small setup overhead; no dynamic allocation in the fast path (O(m) for speed, O(1) for storage)

sublinear search behaviour in good cases (O(n/m))

no worse than the current algorithm in worst case (O(nm))

should work well for both 8-bit strings and 16-bit or 32-bit Unicode strings (no O(σ) dependencies)

many real-life searches should be good, very few should be worst case

reasonably simple implementation

Texture answered 9/8, 2013 at 3:34 Comment(5)

Thanks for the quick reply! Based on this article, effbot.org/zone/stringlib.htm, the time complexity is sublinear , that is better than KMP algorithm. – Aile 9/8, 2013 at 3:49

@Aile In the best cases it can be sublinear. – Texture 9/8, 2013 at 3:49

@arshajiii yes, that's what I want. Thanks again! – Aile 9/8, 2013 at 3:52

@arshajiii One more question, do you know when the good cases happen? I cannot figure out that. thx – Aile 9/8, 2013 at 14:56

@Aile probably when there aren't frequent 'false partial matches' e.g. 'bcdefg' and searching for 'fg' rather than looking for 'aab' in 'aaaacaaab' – Stalder 18/7, 2016 at 2:7

Since 2021, CPython uses Crochemore and Perrin's Two-Way algorithm for larger n.

From the source code:

If the strings are long enough, use Crochemore and Perrin's Two-Way algorithm, which has worst-case O(n) runtime and best-case O(n/k). Also compute a table of shifts to achieve O(n/k) in more cases, and often (data dependent) deduce larger shifts than pure C&P can deduce. See stringlib_find_two_way_notes.txt in this folder for a detailed explanation.

See https://github.com/python/cpython/pull/22904

Unregenerate answered 20/4 at 21:31 Comment(0)

Recommended topics

Hot tags