C# Regular Expressions with \Uxxxxxxxx characters in the pattern
Asked Answered
S

2

8
Regex.IsMatch( "foo", "[\U00010000-\U0010FFFF]" ) 

Throws: System.ArgumentException: parsing "[-]" - [x-y] range in reverse order.

Looking at the hex values for \U00010000 and \U0010FFF I get: 0xd800 0xdc00 for the first character and 0xdbff 0xdfff for the second.

So I guess I have really have one problem. Why are the Unicode characters formed with \U split into two chars in the string?

Spyglass answered 12/12, 2008 at 20:18 Comment(0)
S
10

They're surrogate pairs. Look at the values - they're over 65535. A char is only a 16 bit value. How would you expression 65536 in only 16 bits?

Unfortunately it's not clear from the documentation how (or whether) the regular expression engine in .NET copes with characters which aren't in the basic multilingual plane. (The \uxxxx pattern in the regular expression documentation only covers 0-65535, just like \uxxxx as a C# escape sequence.)

Is your real regular expression bigger, or are you actually just trying to see if there are any non-BMP characters in there?

Shipmate answered 12/12, 2008 at 20:24 Comment(5)
Actually, you're right. From what I've found, \u only supports 4 hex digits (exactly 4, not more not less), \uFFFF is the maximum. I've deleted my "solution" because while it does not produce an error, it does not seem to be a valid unicode regex. I still believe that the \ needs to be escaped.Gopher
Without the @ you would need to escape \ if \UFFFF were regex syntax (like \d for [0-9]), but instead it is string literal syntax (like \n for the new-line character).Sherrilsherrill
This is unfortunate - many modern emoji fall into this category.Ria
@damian: It's entirely possible that in the 7 years since this post, the regex engine has become rather better with respect to this. Note that the question I was answering was only directly about why \Uxxxxxxxx ends up as two chars... the "handling a regex" part is somewhat separate. You might want to ask a new question if you're facing this at the moment.Shipmate
12y since the post, and this still isn't supported.Citole
H
7

To workaround such things with .Net regex engine, I'm using following trick: "[\U010000-\U10FFFF]" is replaced with [\uD800-\uDBFF][\uDC00-\uDFFF] The idea behind this is that as .Net regexes handle code units instead of code points, we're providing it with surrogate ranges as regular characters. It's also possible to specify more narrow ranges by operating with edges, e.g.: [\U011DEF-\U013E07] is same as (?:\uD807[\uDDEF-\uDFFF])|(?:[\uD808-\uD80E][\uDC00-\uDFFF])|(?:\uD80F[\uDC00-uDE07])

It's harder to read and operate with, and it's not that flexible, but still fits as workaround.

Heeley answered 14/1, 2013 at 20:33 Comment(2)
How did you get [\uD800-\uDBFF][\uDC00-\uDFFF] from "[\U010000-\U10FFFF]"?Gillian
@Gillian that's how non-BMP characters are encoded in UTF-16: one character (known as high surrogate) in \uD800-\uDBFF and one character in \uDC00-\uDFFF (low surrogate) code 10 bits each, resutling in 20 bit space for non-BMP characters (and it is implied that it this space starts after \uFFFF). So, rule to map one character would be: subtract 0x10000, add 10 higher bits of resulting number to 0xD800, 10 lower bits to 0xDC00. Of course, with ranges it's getting trickier, but idea is same. See unicodebook.readthedocs.io/…Heeley

© 2022 - 2024 — McMap. All rights reserved.