When should I use mb_strpos(); over strpos();?

Asked 19/4, 2011 at 6:3 Answered 19/4, 2011 at 6:8

Huh, looking at all those string functions, sometimes I get confused. One is using all the time mb_ functions, the other - plain ones, so the question is simple...

When should I use mb_strpos(); and when should I go with the plain one (strpos();)?

And, yes, I'm aware about that mb_ functions stand for multi-byte, but does it really mean, that if I'm working with only utf-8 encoded strings, I should stick with mb_ functions?

Thanks in advance!

Sotos answered 19/4, 2011 at 6:3 Comment(0)

You should use the mb_ functions whenever you expect to work with text that's not pure ASCII. I.e. you can work with the regular string functions, even if you're using UTF-8, as long as all the strings you're using them on only contain ASCII characters.

strpos('foobar', 'foo')  // fine in any (ASCII-compatible) encoding, including UTF-8
strpos('ふーばー', 'ふー') // won't work as expected, use mb_strpos instead

Pool answered 19/4, 2011 at 6:8 Comment(3)

And what about performance when using mb_ functions on UTF-8 strings containing only ASCII characters? Would it be slower/faster/no significant difference (as always..)? interested because mb_ looks like de facto choice when working with UTF-8 strings, even if they contain only ASCII – Sotos 19/4, 2011 at 6:16

@Tom As always: profile it, if you're concerned. :) My hunch is that it would be slightly slower, but not in any significant way, unless you're working on massively large strings. – Pool 19/4, 2011 at 6:19

no point in profiling if the result is ms's in either way (and lets not argue on this one, of course in long-running scripts it could become noticeable) and/or you can kind of predict the result logically. But, huh, I already thought that it'd be slower, just wanted to know different opinion. Anyways, thanks, accepted! – Sotos 19/4, 2011 at 6:26

Yes, if working with UTF-8 (which is a multi-byte encoding : one character can use more than one byte), you should use the mb_* functions.

The non-mb functions will work on bytes, and not characters -- which is fine when 1 character == 1 byte ; but that's not the case with (for example) UTF-8.

Impostor answered 19/4, 2011 at 6:6 Comment(3)

Should be: "...that's not necessarily the case with UTF-8." The lower UTF-8 ranges are identical to ASCII and work just fine with the regular string functions. – Pool 19/4, 2011 at 6:9

Chose deceze's answer because of ASCII only statement. – Sotos 19/4, 2011 at 6:28

@Pool just curious, are there any pitfalls to using mb_* all the time rather than the regular string functions? Microptimization aside. – Marengo 19/4, 2011 at 6:32

I'd say yes, here's the description from the php documentation:

mbstring provides multibyte specific string functions that help you deal with multibyte encodings in PHP. In addition to that, mbstring handles character encoding conversion between the possible encoding pairs. mbstring is designed to handle Unicode-based encodings such as UTF-8 and UCS-2 and many single-byte encodings for convenience....

If you're not sure that the mb extension is loaded, you should check before because mb-string is a non-default extension.

Alcoholometer answered 19/4, 2011 at 6:7 Comment(0)

Recommended topics

Hot tags