Escaping unicode characters with C/C++

I need to escape unicode characters within a input string to either UTF-16 or UTF-32 escape sequences. For example, the input string literal "Eat, drink, 愛" should be escaped as "Eat, drink, \u611b". Here are the rules in a table of sorts:

Escape | Unicode code point

'\u' HEX HEX HEX HEX | A Unicode code point in the range U+0 to U+FFFF inclusive corresponding to the encoded hexadecimal value.

'\U' HEX HEX HEX HEX HEX HEX HEX HEX | A Unicode code point in the range U+0 to U+10FFFF inclusive corresponding to the encoded hexadecimal value.

It's simple to detect Unicode characters in general, since the second byte is 0 if ASCII:

L"a" = 97, 0

, which will not be escaped. With Unicode characters the second byte is never 0:

L"愛" = 27, 97

, which is escaped as \u611b. But how do I detect UTF-32 a string as it is to be escaped differently than UTF-16 with 8 hex numbers?

It is not as simple as just checking the size of the string, as UTF-16 characters are multibyte, e.g. :

L"प्रे" = 42, 9, 77, 9, 48, 9, 71, 9

I'm tasked to escape unescaped input string literals like Eat, drink, 愛 and store them to disk in their escaped literal form Eat, drink, \u611b (UTF-16 example) If my program finds a UTF-32 character it should escape those too in the form\U8902611b (UTF-32 example), but I can't find a certain way of knowing if I'm dealing with UTF-16 or UTF-32 in an input byte array. So, just how can I reliably differ UTF-32 from UTF-16 characters within a wchar_t string or byte array?

There are many questions within your question, I will try to answer the most important ones.

Q. I have a C++ string like "Eat, drink, 愛", is it a UT8-8, UTF-16 or UTF-32 string?
A. This is implementation-defined. In many implementations this will be a UTF-8 string, but this is not mandated by the standard. Consult your documentation.

Q. I have a wide C++ string like L"Eat, drink, 愛", is it a UT8-8, UTF-16 or UTF-32 string?
A. This is implementation-defined. In many implementations this will be a UTF-32 string. In some other implementations it will be a UTF-16 string. Neither is mandated by the standard. Consult your documentation.

Q. How can I have portable UT8-8, UTF-16 or UTF-32 C++ string literals?
A. In C++11 there is a way:

u8"I'm a UTF-8 string."
u"I'm a UTF-16 string."
U"I'm a UTF-32 string."

In C++03, no such luck.

Q. Does the string "Eat, drink, 愛" contain at least one UTF-32 character?
A. There are no such things as UTF-32 (and UTF-16 and UTF-8) characters. There are UTF-32 etc. strings. They all contain Unicode characters.

Q. What the heck is a Unicode character?
A. It is an element of a coded character set defined by the Unicode standard. In a C++ program it can be represented in various ways, the most simple and straightforward one is with a single 32-bit integral value corresponding to the character's code point. (I'm ignoring composite characters here and equating "character" and "code point", unless stated otherwise, for simplicity).

Q. Given a Unicode character, how can I escape it?
A. Examine its value. If it's between 256 and 65535, print a 2-byte (4 hex digits) escape sequence. If it's greater than 65535, print a 3-byte (6 hex digits) escape sequence. Otherwise, print it as you normally would.

Q. Given a UTF-32 encoded string, how can I decompose it to characters?
A. Each element of the string (which is called a code unit) corresponds to a single character (code point). Just take them one by one. Nothing special needs to be done.

Q. Given a UTF-16 encoded string, how can I decompose it to characters?
A. Values (code units) outside of the 0xD800 to 0xDFFF range correspond to the Unicode characters with the same value. For each such value, print either a normal character or a 2-byte (4 hex digits) escape sequence. Values in the 0xD800 to 0xDFFF range are grouped in pairs, each pair representing a single character (code point) in the U+10000 to U+10FFFF range. For such a pair, print a 3-byte (6 hex digits) escape sequence. To convert a pair (v1, v2) to its character value, use this formula:

c = (v1 - 0xd800) >> 10 + (v2-0xdc00)

Note the first element of the pair must be in the range of 0xd800..0xdbff and the second one is in 0xdc00..0xdfff, otherwise the pair is ill-formed.

Q. Given a UTF-8 encoded string, how can I decompose it to characters?
A. The UTF-8 encoding is a bit more complicated than the UTF-16 one and I will not detail it here. There are many descriptions and sample implementations out there on the 'net, look them up.

Q. What's up with my L"प्रे" string?
A. It is a composite character that is composed of four Unicode code points, U+092A, U+094D, U+0930, U+0947. Note it's not the same as a high code point being represented with a surrogate pair as detailed in the UTF-16 part of the answer. It's a case of "character" being not the same as "code point". Escape each code point separately. At this level of abstraction, you are dealing with code points, not actual characters anyway. Characters come into play when you e.g. display them for the user, or compute their position in a printed text, but not when dealing with string encodings.

Recommended topics

Hot tags