I am currently matching HTML using this code:
preg_match('/<\/?([a-z]+)[^>]*>|&#?[a-zA-Z0-9]+;/u', $html, $match, PREG_OFFSET_CAPTURE, $position)
It matches everything perfect, however if I have a multibyte character, it counts it as 2 characters when giving back the position.
For example the returned $match
array would give something like:
array
0 =>
array
0 => string '<br />' (length=6)
1 => int 132
1 =>
array
0 => string 'br' (length=2)
1 => int 133
The real number for the <br />
match is 128, but there are 4 multibyte characters, so it's giving 132. I really thought adding the /u modifier would make it realize what's going on, but no luck there.