PHP: strpos & substr with UTF-8
Asked Answered
D

2

6

Say I have a long UTF-8 encoded string.

And say I want to detect if $var exists in this string.

Assuming $var is always going to be simple letters or numbers of ascii characters (e.g. "hello123") I shouldn't need to use mb_strpos or iconv_strpos right? Because it doesn't matter if the position is not character-wise correct as long as its consistent with the other functions.

Example:

$var='hello123';
$pos=strpos($utf8string,$var);
if ($pos!==false) $uptohere=substr($ut8string,0,$pos);

Am I correct that the above code will extract everything up to 'hello123' regardless of whether the string contains fancy UTF-8 characters? My logic is that because both strpos and substr will be consistent with each other (even if this is consistently wrong) then it should still work.

Dobbs answered 24/2, 2013 at 10:13 Comment(2)
Surely it doesn't matter because substr will think exactly the same and so it will crop the string still at the correct point?Dobbs
ASCII being UTF8 compatible, I do believe it should work. But why don't you just try with a few Chinese / Japanese strings?Insensible
G
11

Yes, you are correct. There's no ambiguity about the characters themselves, i.e. hello123 can't possibly anything else in UTF-8. The way you're slicing it, it doesn't matter whether you're slicing by character or by byte number.

So yes, this is safe, as long as your string is UTF-8 and thereby ASCII compatible.

See here for quick test: http://3v4l.org/XnM8s

Why this works:

The string "漢字hello123" in UTF-8 looks like this as bytes (I hope this aligns correctly):

e6 | bc | a2 | e5 | ad | 97 | 68 | 65 | 6c | 6c | 6f | 31 | 32 | 33
     漢      |      字      | h  | e  | l  | l  | o  | 1  | 2  | 3

strpos will look for the byte sequence 68656c6c6f313233, returning 6 as the starting byte of "hello123". substr will slice 6 bytes from byte 0, returning "漢字". There is no ambiguity. You're finding and slicing by bytes, it doesn't matter how many characters there are.

You need to either work entirely in characters, in which case the string functions must be encoding aware. Or you work entirely in bytes, in which case the only requirement is that bytes aren't ambiguous (say "hello123" could match "中国" encoded in BIG5, because the bytes are the same (they don't, just an example)). UTF-8 is self-synchronizing, meaning there's no such ambiguity.

Geriatric answered 24/2, 2013 at 10:24 Comment(5)
I always get confused which way the utf8 and ascii compatibility goes ;-) I figured that ascii is utf8 compatible.Insensible
Thank you for your reply, it is very helpful. Still though, if the string was not UTF-8, say it was 10 chars, and strlen thought it was 20 chars, then strpos might find my $var at 12, then substr would still cut it at the correct place..? Please explain why this isn't the case if it isn't.Dobbs
@Jack "Compatible" in the sense that an ASCII string (e.g. hello123) can't be mistaken for some other string. I'm just making this example up because I don't know of a working example off the top of my head, but hello123 in ASCII may match the string 中国 in BIG5.Geriatric
@Dobbs Not sure what you're trying to ask, see update for more explanation though. :)Geriatric
"You're finding and slicing by bytes, it doesn't matter how many characters there are." This means that the needle can also be utf-8? If I try to search "字hello123" I think we can apply the same reasoning that you did... there is a risk of corrupting the string?Brickey
D
4

In UTF-8 you must use mb_* functions, in your case you need replace substr to

mb_substr($var, 0, N, 'UTF-8');

mb_substr()

Dongdonga answered 24/2, 2013 at 10:21 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.