multi-byte characters in libc regcomp and regexec

A

3

8

Is there anyway to get libc6's regexp functions regcomp and regexec to work properly with multi-byte characters?

For instance, if my pattern is the utf8 characters 猫机+猫, finding a match on the utf8 encoded string 猫机机机猫 will fail, where it should succeed.

I think this is because the character 机's byte representation is \xe6\x9c\xba, and the + is matching one or more of the byte \xba. I can make this instance work by putting parenthesis around each multibyte character in the pattern, but since this is for an application I can't require users to do this.

Is there a way to flag a pattern or string to match as containing utf8 characters? Perhaps telling libc to store the pattern as wchar instead of char?

Academy answered 23/1, 2015 at 17:52 Comment(4)

Parens around the multi-byte char don't help? – Nonperishable 23/1, 2015 at 17:57

I can do that, but I am hoping for a solution that doesn't require the user to change the pattern in such a way. Thank you though! I edited the question to reflect your comment. – Academy 23/1, 2015 at 17:59

Why not just use codepoints \x{nnnnnnn} ? That is, if the regex engine should support Unicode. Usually the regex and target string should use the same encoding, but its not a good idea to use literal Unicode chars within a regex string. If the engine supports it though, it reads the char in char units, not byte units. – Viper 29/1, 2015 at 7:24

No, these options don't work because I'm hoping to use this within an application that shouldn't require users to alter their regexps. Does this mean there is no support for multibyte chars in libc? Is there another extremely common c library I could use instead? – Academy 2/2, 2015 at 17:14

B

1

According to its manual page, glibc understands POSIX regexp. There is no unicode support in POSIX regexp per se. See this answer for an excerpt of the standard that enlightens this point. This means that you can also forget about UTF. This means also that whatever locale environment you're in, multi-byte characters won't fit.

The post I've mentionned (as well as this one) suggests you use some unicode-aware regexp library, such as pcre. If you're interested, pcre provides a fake posix interface, with the addition of a non-standard REG_UTF flag. You won't have to rewrite your code, except for the #include directive, and the addition of REG_UTF at compile step.

Hope this covers your needs.

Blamed answered 28/7, 2021 at 14:56 Comment(0)

P

1

Can you use a regex to build your regex? Here's a javascript example, (though I know you aren't using js):

function Examp () {
  var uString = "猫机+猫+猫ymg+sah猫";
  var plussed = uString.replace(/(.)(?=[\+\*])/ig,"($1)");
  console.log("Starting with string: " + uString + "\r\n" + "Result: " + plussed);
  uString = "猫机+猫*猫ymg+s\\a+I+h猫";
  plussed = uString.replace(/(\\?.)(?=[\+\*])/ig,"($1)");
  console.log("You can even take this a step further and account for a character being escaped, if that's a consideration.")
  console.log("Starting with string: " + uString + "\r\n" + "Result: " + plussed);
}

<input type="button" value="Run" onclick="Examp()" />

Pereyra answered 21/2, 2015 at 9:2 Comment(0)

B

1

According to its manual page, glibc understands POSIX regexp. There is no unicode support in POSIX regexp per se. See this answer for an excerpt of the standard that enlightens this point. This means that you can also forget about UTF. This means also that whatever locale environment you're in, multi-byte characters won't fit.

The post I've mentionned (as well as this one) suggests you use some unicode-aware regexp library, such as pcre. If you're interested, pcre provides a fake posix interface, with the addition of a non-standard REG_UTF flag. You won't have to rewrite your code, except for the #include directive, and the addition of REG_UTF at compile step.

Hope this covers your needs.

Blamed answered 28/7, 2021 at 14:56 Comment(0)

R

0

Is there a way to flag a pattern or string to match as containing utf8 characters?

I suspect that LC_CTYPE environment variable (or other related locale settings) is the way to make regcomp/regexec understand your encoding.

At least, grep program seems to take it into account, as shown in https://mcmap.net/q/1473171/-what-does-constitute-one-character-for-regcomp-which-multibyte-encoding-does-determine-this; I haven't tested this with regcomp function.

Rosary answered 26/11, 2016 at 23:35 Comment(0)

Recommended topics

Hot tags