Why is mb_strpos() so considerably slower than strpos()?

Asked 21/6, 2014 at 18:16 Answered 7/9, 2017 at 17:57

I had criticized an answer that suggested preg_match over === when finding substring offsets in order to avoid type mismatch.

However, later on the answer's author has discovered that preg_match is actually significantly faster than multi-byte operating mb_strpos. Normal strpos is faster than both functions but of course, cannot deal with multibyte strings.

I understand that mb_strpos needs to do something more than strpos. However, if regex can do it almost as fast as strpos, what is it that mb_strpos does that takes so much time?

I have strong suspicion that it's an optimization error. Could, for example, PHP extensions be slower than its native functions?

mb_strpos($str, "颜色", 0 ,"GBK"): 15.988190889 (89%)
preg_match("/颜色/", $str): 1.022506952 (6%)
strpos($str, "dh"): 0.934401989 (5%)

Functions were run 10⁶ times. The absolute time(s) accounts for the sum of time of 10⁶ runs of a function, rather than average for one.

The test string is $str = "代码dhgd颜色代码";. The test can be seen here (scroll down to skip the testing class).

Note: According to one of the commentators (and common sense), preg_match also does not use multi-byte when comparing, being subject to same risk of errors as strpos.

Rustice answered 21/6, 2014 at 18:16 Comment(14)

Your regex does not use the Unicode /u flag, so likely just does a binary comparison. – Lisettelisha 21/6, 2014 at 18:24

I don't quite get this. If dumb preg_match (without the u modifier) works then plain old strpos must also work (and obviously will be faster). Please clarify. – Lupita 21/6, 2014 at 18:24

@Lupita They may use different search algorithms. I think strpos uses the ‘dumbest’ while PCRE may use a smarter one. – Acroter 21/6, 2014 at 18:28

@Lupita - strops() can find multibyte sequences, but it may also return false positives for multibyte sequences because it doesn't recognise character boundaries for multibyte characters, it only works on byte sequences – Cordeiro 21/6, 2014 at 18:29

Note that performance of preg_match will depend on the underlying version of pcre – Cordeiro 21/6, 2014 at 18:30

@MarkBaker: I 'm aware of that, but the OP evidently is not. And as you undoubtedly know, in the most common case where we 're talking UTF-8 the possibility for false positives does not exist. – Lupita 21/6, 2014 at 18:30

@TomášZato , can you repeat the test and report times using a much longer haystack with the same needle in the middle? That way it would be more apparent whether the problem is mostly in the setup code (I'm seeing a lot of initialization in "ext/mbstring/libmbfl/mbfl/mbfilter.c" that looks expensive, I'll valgrind it later to see whether it really is - or if it's the search loop. – Lysis 21/6, 2014 at 18:31

@TomášZato: preg_match has exactly the same pitfalls as strpos. It makes absolutely no sense to consider preg_match a solution to the problem and bar strpos from also being one. Also, look at this: ideone.com/ULItqd – Lupita 21/6, 2014 at 18:32

Also, simply using strpos() to test if a particular sequence exists in a multibyte string is one thing, but using it to get the character offset is another matter entirely. strops() will return the byte offset, mb_strops() will return the character offset – Cordeiro 21/6, 2014 at 18:35

One thing that mb_strpos does that you cannot get elsewhere is tell you the character offset of the substring regardless of what the input encoding is. – Lupita 21/6, 2014 at 18:35

@MarkBaker: Sure (we were typing concurrently it seems). But preg_match is considered a solution here, and it doesn't tell you the character offset either. – Lupita 21/6, 2014 at 18:36

@Lupita Using the PREG_OFFSET_CAPTURE flag would give you the offset, but I'm not sure how much of an overhead that would add – Cordeiro 21/6, 2014 at 18:37

@TomášZato It's also worth noting that multiple runs with PCRE will give you a result that is not entirely representative of the reality. PHP caches compiled regexes, meaning that on each iteration after the first there is substantially less work for PCRE to do. mb_strpos() on the other hand always has a lot more work to do because it always pays attention to character sets (meaning it has to find the correct tables etc) and it actually checks character boundaries. PCRE, without unicode input and /u, is just doing byte comparisons, optimised to essentially the same logic as strpos(). – Duramen 21/6, 2014 at 18:54

@JakeGould well it's not even checking unicode. If the haystack is utf8 and the needle is utf16, it won't find it, because it's just doing a byte-by-byte comparison. It's also not the case the PCRE is "simple", but an expr that is just /literal/ with no complicated regexp things (repetition, char classes, etc etc) it will be optimised to be essentially the same as calling strpos() with a little bit of overhead (actually, the regex compile step is pretty expensive, but since it's cached that's not really relevant in a 10^6 loop...) – Duramen 21/6, 2014 at 19:19

To understand why the functions have a different runtime you need to understand what they actually do. Because summing them up as ‘they search for needle in haystack’ isn’t enough.

`strpos`

If you look at the implementation of strpos, it uses zend_memstr internally, which implements a pretty naive algorithm for searching for needle in haystack: Basically, it uses memchr to find the first byte of needle in haystack and then uses memcmp to check whether the whole needle begins at that position. If not, it repeats the search for the first byte of needle from the position of the previous match of the first byte.

Knowing this, we can say that strpos does only search for a byte sequence in a byte sequence using a naive search algorithm.

`mb_strpos`

This function is the multi-byte counterpart to strpos. This makes searching a little more complex as you can’t just look at the bytes without knowing to which character they belong to.

mb_strpos uses mbfl_strpos, which does a lot more in comparison to the simple algorithm of zend_memstr, it’s like 200 lines of complex code (mbfl_strpos) compared to 30 lines of slick code (zend_memstr).

We can skip the part where both needle and haystack are converted to UTF-8 if necessary, and come to the major chunk of code.

First we have two setup loops and then there is the loop that proceeds the pointer according to the given offset where you can see that they aware of actual characters and how they skip whole encoded UTF-8 characters: since UTF-8 is a variable-width character encoding where the first byte of each encoded character denotes the whole length of the encoded character. This information is stored in the u8_tbl array.

Finally, the loop where the actual search happens. And here we have something interesting, because the test for needle at a certain position in haystack is tried in reverse. And if one byte did not match, the jump table jtbl is used to find the next possible position for needle in haystack. This is actually an implementation of the Boyer–Moore string search algorithm.

So now we know that mb_strpos …

converts the strings to UTF-8, if necessary
is aware of actual characters
uses the Boyer–Moore search algorithm

`preg_match`

As for preg_match, it uses the PCRE library. Its standard matching algorithm uses a nondeterministic finite automaton (NFA) to find a match conducting a depth-first search of the pattern tree. This is basically a naive search approach.

Acroter answered 21/6, 2014 at 22:8 Comment(3)

Very nice write-up. One thing I wondered about (and didn't put in my simple answer) is if mb does also unicode normalization when doing an unicode search. But I imagined that would probably too much for that lib so I skipped it. - regarding the search algorithm, does it mean that whith a much larger haystack, the relative times would shift, towards mb_strpos, right? – Trull 21/6, 2014 at 22:35

@hakre: I seem to remember that the BM algorithm works better with longer needles, made up of relatively few characters in respect to the haystack. Even better if the haystack contains a lot of characters that don't occur in the needle. Then BM can run onwards in leaps and bounds, way faster than naive byte compare. – Lysis 22/6, 2014 at 12:57

Terrific explanation... I'm glad I'm not the one having to program these functions! :) +1 – Revocable 26/6, 2014 at 4:33

I am leaving out preg_match to make the analysis more punctuated.

Taken your observation that mb_strpos is relatively slower compared to strpos, it leads you to the assumption that — because of the consumed time — mb_strpos does more than strpos.

I think this observation is correct.

You then asked what is that "more" that is causing the time difference.

I try to give a simple answer: That "more" is because strpos operates on binary strings (one character = 8 bit = 1 octet = 1 byte). mb_strpos operates on encoded character sequences (as nearly all of the mb_* functions do) which can be X bits, perhaps even in variable length per each character.

As this is always about a specific character encoding, both the haystack as well as the needle string (probably) need to be first validated for that encoding, and then the whole operation to find the string position needs to be done in that specific character encoding.

That is translation work and — depending on encoding — also requires a specific search algorithm.

Next to that the mb extension also needs to have some structures in memory to organize the different character encodings, be it translation tables and/or specific algorithms. See the extra parameter you inject — the name of the encoding for example.

That is by far more work than just doing simple byte-by-byte comparisons.

For example the GBK character encoding is pretty interesting when you need to encode or decode a certain character. The mb string function in this case needs to take all these specifics into account to find out if and at which position the character is. As PHP only has binary strings in the userland from which you would call that function, the whole work needs to be done on each single function call.

To illustrate this even more, if you look through the list of supported encodings (mb_list_encodings), you can also find some like BASE64, UUENCODE, HTML-ENTITIES and Quoted-Printable. As you might imagine, all these are handled differently.

For example a single numeric HTML entity can be up to 1024 bytes large, if not even larger. An extreme example I know and love is this one. However, for that encoding, it has to be handled by the mb_strpos algorithm.

Trull answered 21/6, 2014 at 19:26 Comment(4)

OK, it does more, we knew that. But why is PCRE apparently so much faster for the apparently same task? – Vizor 21/6, 2014 at 19:42

@deceze: PCRE in PHP knows only two encodings: the one of strpos (where it is slower) and UTF-8. For UTF-8 a lot of optimized code exists. That is far beyond what the mb extension offers as encodings, the linked manual page counts 60 different. – Trull 21/6, 2014 at 19:45

@deceze: Also both are not doing the same even when they both use UTF-8 encoding: mb_strpos returns the numeric character position, preg_match returns the byte-offset (if you enable offset returning). That also shows that internally mb_strpos has to do things differently, e.g. managing character positions next to byte offsets. I don't find it that hard to imagine that these things might take some times - and also the quality of the library might not be as precise as pcre which is pretty widely used. – Trull 21/6, 2014 at 19:57

@AmalMurali: Thanks for the touch-up and comment :) – Trull 21/6, 2014 at 22:30

Reason of slowness

Taking a look at the 5.5.6 PHP source files, the delay seems to arise for the most part in the mbfilter.c, where - as hakre surmised - both haystack and needle need to be validated and converted, every time mb_strpos (or, I guess, most of the mb_* family) gets called:

Unless haystack is in the default format, encode it to the default format:

if (haystack->no_encoding != mbfl_no_encoding_utf8) {
        mbfl_string_init(&_haystack_u8);
        haystack_u8 = mbfl_convert_encoding(haystack, &_haystack_u8, mbfl_no_encoding_utf8);
        if (haystack_u8 == NULL) {
                result = -4;
                goto out;
        }
} else {
        haystack_u8 = haystack;
}

Unless needle is in the default format, encode it to the default format:

if (needle->no_encoding != mbfl_no_encoding_utf8) {
        mbfl_string_init(&_needle_u8);
        needle_u8 = mbfl_convert_encoding(needle, &_needle_u8, mbfl_no_encoding_utf8);
        if (needle_u8 == NULL) {
                result = -4;
                goto out;
        }
} else {
        needle_u8 = needle;
}

According to a quick check with valgrind, the encoding conversion accounts for a huge part of mb_strpos's runtime, about 84% of the total, or five-sixths:

218,552,085  ext/mbstring/libmbfl/mbfl/mbfilter.c:mbfl_strpos [/usr/src/php-5.5.6/sapi/cli/php]
183,812,085  ext/mbstring/libmbfl/mbfl/mbfilter.c:mbfl_convert_encoding [/usr/src/php-5.5.6/sapi/cli/php]

which appears to be consistent with the OP's timings of mb_strpos versus strpos.

Encoding not considered, mb_strpos'ing a string is exactly the same of strpos'ing a slightly longer string. Okay, a string up to four times as long if you have really awkward strings, but even then, you would get a delay by a factor of four, not by a factor of twenty. The additional 5-6X slowdown arises from encoding times.

Accelerating `mb_strpos`...

So what can you do? You can skip those two steps by ensuring that you have internally the strings already in the "basic" format in which mbfl* do conversion and compare, which is mbfl_no_encoding_utf8 (UTF-8):

Keep your data in UTF-8.
Convert user input to UTF-8 as soon as practical.
Convert, if necessary, back to client encoding if needed.

Then your pseudo-code:

$haystack = "...";
$needle   = "...";

$res = mb_strpos($haystack, $needle, 0, $Encoding);

becomes:

$haystack = "...";
$needle   = "...";

mb_internal_encoding('UTF-8') or die("Cannot set encoding");
$haystack   = mb_convert_encoding($haystack, 'UTF-8' [, $SourceEncoding]);
$needle     = mb_convert_encoding($needle, 'UTF-8', [, $SourceEncoding]);

$res = mb_strpos($haystack, $needle, 0);

...when it's worth it

Of course this is only convenient if the "setup time" and maintenance of a whole UTF-8 base is appreciably smaller than the "run time" of doing conversions implicitly in every mb_* function.

Lysis answered 21/6, 2014 at 21:20 Comment(0)

-1

The problems with mb_ performance may be caused by a messed php-mbstring package installation (on a linux). Installing it explicitly for the exact version of php installation helped me.

sudo apt-get install php7.1-mbstring

...

Before: Time: 16.17 seconds, Memory: 36.00MB OK (3093 tests, 40272 assertions)
After:  Time:  1.81 seconds, Memory: 36.00MB OK (3093 tests, 40272 assertions)

Drill answered 7/9, 2017 at 17:57 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

strpos

mb_strpos

preg_match