Is there anyway to get libc6
's regexp functions regcomp
and regexec
to work properly with multi-byte characters?
For instance, if my pattern is the utf8 characters 猫机+猫
, finding a match on the utf8 encoded string 猫机机机猫
will fail, where it should succeed.
I think this is because the character 机
's byte representation is \xe6\x9c\xba
, and the +
is matching one or more of the byte \xba
. I can make this instance work by putting parenthesis around each multibyte character in the pattern, but since this is for an application I can't require users to do this.
Is there a way to flag a pattern or string to match as containing utf8 characters? Perhaps telling libc
to store the pattern as wchar instead of char?
\x{nnnnnnn}
? That is, if the regex engine should support Unicode. Usually the regex and target string should use the same encoding, but its not a good idea to use literal Unicode chars within a regex string. If the engine supports it though, it reads the char in char units, not byte units. – Viper