Reason of slowness
Taking a look at the 5.5.6 PHP source files, the delay seems to arise for the most part in the mbfilter.c, where - as hakre surmised - both haystack and needle need to be validated and converted, every time mb_strpos
(or, I guess, most of the mb_*
family) gets called:
Unless haystack is in the default format, encode it to the default format:
if (haystack->no_encoding != mbfl_no_encoding_utf8) {
mbfl_string_init(&_haystack_u8);
haystack_u8 = mbfl_convert_encoding(haystack, &_haystack_u8, mbfl_no_encoding_utf8);
if (haystack_u8 == NULL) {
result = -4;
goto out;
}
} else {
haystack_u8 = haystack;
}
Unless needle is in the default format, encode it to the default format:
if (needle->no_encoding != mbfl_no_encoding_utf8) {
mbfl_string_init(&_needle_u8);
needle_u8 = mbfl_convert_encoding(needle, &_needle_u8, mbfl_no_encoding_utf8);
if (needle_u8 == NULL) {
result = -4;
goto out;
}
} else {
needle_u8 = needle;
}
According to a quick check with valgrind
, the encoding conversion accounts for a huge part of mb_strpos
's runtime, about 84% of the total, or five-sixths:
218,552,085 ext/mbstring/libmbfl/mbfl/mbfilter.c:mbfl_strpos [/usr/src/php-5.5.6/sapi/cli/php]
183,812,085 ext/mbstring/libmbfl/mbfl/mbfilter.c:mbfl_convert_encoding [/usr/src/php-5.5.6/sapi/cli/php]
which appears to be consistent with the OP's timings of mb_strpos
versus strpos
.
Encoding not considered, mb_strpos
'ing a string is exactly the same of strpos
'ing a slightly longer string. Okay, a string up to four times as long if you have really awkward strings, but even then, you would get a delay by a factor of four, not by a factor of twenty. The additional 5-6X slowdown arises from encoding times.
Accelerating mb_strpos
...
So what can you do? You can skip those two steps by ensuring that you have internally the strings already in the "basic" format in which mbfl*
do conversion and compare, which is mbfl_no_encoding_utf8
(UTF-8):
- Keep your data in UTF-8.
- Convert user input to UTF-8 as soon as practical.
- Convert, if necessary, back to client encoding if needed.
Then your pseudo-code:
$haystack = "...";
$needle = "...";
$res = mb_strpos($haystack, $needle, 0, $Encoding);
becomes:
$haystack = "...";
$needle = "...";
mb_internal_encoding('UTF-8') or die("Cannot set encoding");
$haystack = mb_convert_encoding($haystack, 'UTF-8' [, $SourceEncoding]);
$needle = mb_convert_encoding($needle, 'UTF-8', [, $SourceEncoding]);
$res = mb_strpos($haystack, $needle, 0);
...when it's worth it
Of course this is only convenient if the "setup time" and maintenance of a whole UTF-8 base is appreciably smaller than the "run time" of doing conversions implicitly in every mb_*
function.
/u
flag, so likely just does a binary comparison. – Lisettelishapreg_match
(without theu
modifier) works then plain oldstrpos
must also work (and obviously will be faster). Please clarify. – Lupitastrpos
uses the ‘dumbest’ while PCRE may use a smarter one. – Acroterpreg_match
has exactly the same pitfalls asstrpos
. It makes absolutely no sense to considerpreg_match
a solution to the problem and barstrpos
from also being one. Also, look at this: ideone.com/ULItqd – Lupitamb_strpos
does that you cannot get elsewhere is tell you the character offset of the substring regardless of what the input encoding is. – Lupitapreg_match
is considered a solution here, and it doesn't tell you the character offset either. – Lupitamb_strpos()
on the other hand always has a lot more work to do because it always pays attention to character sets (meaning it has to find the correct tables etc) and it actually checks character boundaries. PCRE, without unicode input and/u
, is just doing byte comparisons, optimised to essentially the same logic asstrpos()
. – Duramen/literal/
with no complicated regexp things (repetition, char classes, etc etc) it will be optimised to be essentially the same as callingstrpos()
with a little bit of overhead (actually, the regex compile step is pretty expensive, but since it's cached that's not really relevant in a 10^6 loop...) – Duramen