Escaping unicode characters with C/C++
Asked Answered
S

1

7

I need to escape unicode characters within a input string to either UTF-16 or UTF-32 escape sequences. For example, the input string literal "Eat, drink, 愛" should be escaped as "Eat, drink, \u611b". Here are the rules in a table of sorts:

Escape | Unicode code point


'\u' HEX HEX HEX HEX | A Unicode code point in the range U+0 to U+FFFF inclusive corresponding to the encoded hexadecimal value.


'\U' HEX HEX HEX HEX HEX HEX HEX HEX | A Unicode code point in the range U+0 to U+10FFFF inclusive corresponding to the encoded hexadecimal value.


It's simple to detect Unicode characters in general, since the second byte is 0 if ASCII:

L"a" = 97, 0

, which will not be escaped. With Unicode characters the second byte is never 0:

L"愛" = 27, 97

, which is escaped as \u611b. But how do I detect UTF-32 a string as it is to be escaped differently than UTF-16 with 8 hex numbers?

It is not as simple as just checking the size of the string, as UTF-16 characters are multibyte, e.g. :

L"प्रे" = 42, 9, 77, 9, 48, 9, 71, 9

I'm tasked to escape unescaped input string literals like Eat, drink, 愛 and store them to disk in their escaped literal form Eat, drink, \u611b (UTF-16 example) If my program finds a UTF-32 character it should escape those too in the form\U8902611b (UTF-32 example), but I can't find a certain way of knowing if I'm dealing with UTF-16 or UTF-32 in an input byte array. So, just how can I reliably differ UTF-32 from UTF-16 characters within a wchar_t string or byte array?

Stockbreeder answered 24/5, 2014 at 10:10 Comment(9)
I don't understand what you're asking. In what form are you storing the input Eat, drink, 愛? Your question seems as if you're storing it as a byte array with unspecified encoding, and you're attempting to guess what encoding it might be, not for the whole array, but for each individual character.Lida
@hvd I'm tasked to escape unescaped input string literals like Eat, drink, 愛 and store them to disk in their escaped literal form Eat, drink, \u611b (UTF-16 example) If my program finds a UTF-32 character it should escape those too in the form like \U8902611b (UTF-32 example), but I can't find a certain way of knowing if I'm dealing with UTF-16 or UTF-32 in an input byte array.Stockbreeder
Note that Windows wchar_t is 2 bytes, Linux wchar_t is 4 bytes. Windows wchar_t will take UTF-16 but not UTF-32. Linux wchar_t will take both. If you just have a stream of bytes, there is no way of differentiating unless there is a Byte Order Mark (BOM) character at the start of the sequence.Vibrator
Literals that contain any character beyond the basic source character set (which is a subset of ASCII) are implementation-dependent, any method of dealing with them is non-portable. Whether your literals are UTF-16, UTF-32, UTF-8 or anything else is determined solely by your implementation and not by the content of the string.Masefield
You can either require that requisite knowledge of the input format (e.g. implied by platform, or via a BOM, or other means) in order to produce a single transformed output, or you can produce both possible outputs. In the absence of encoding knowledge there is no good way to distinguish UTF-16 bytes from UTF-32 bytes other than statistically. There is a statistics-based detector in the Windows API, but it has made some quite infamous mis-identifications.Kaunas
If the question is: "I need another way to encode text" then the answer surely should be: "there enough ways already, you do not want yet another way to make it difficult to read text".Unsaddle
@Vibrator From the standard: "The size of a char32_t or wide string literal is the total number of escape sequences, universal-character-names, and other characters, plus one for the terminating U'\0' or L'\0'." That means any implementation that uses UTF-16 for wchar_t is non-conforming, but implementations are allowed to use UCS-2 for wchar_t. Are you sure the implementations you're referring to don't actually use UCS-2?Lida
@hvd The Microsoft compiler is non-conforming, it's a well-known fact.Masefield
@IngeHenriksen I have added some info to the answer.Masefield
K
19

There are many questions within your question, I will try to answer the most important ones.

Q. I have a C++ string like "Eat, drink, 愛", is it a UT8-8, UTF-16 or UTF-32 string?
A. This is implementation-defined. In many implementations this will be a UTF-8 string, but this is not mandated by the standard. Consult your documentation.

Q. I have a wide C++ string like L"Eat, drink, 愛", is it a UT8-8, UTF-16 or UTF-32 string?
A. This is implementation-defined. In many implementations this will be a UTF-32 string. In some other implementations it will be a UTF-16 string. Neither is mandated by the standard. Consult your documentation.

Q. How can I have portable UT8-8, UTF-16 or UTF-32 C++ string literals?
A. In C++11 there is a way:

u8"I'm a UTF-8 string."
u"I'm a UTF-16 string."
U"I'm a UTF-32 string."

In C++03, no such luck.

Q. Does the string "Eat, drink, 愛" contain at least one UTF-32 character?
A. There are no such things as UTF-32 (and UTF-16 and UTF-8) characters. There are UTF-32 etc. strings. They all contain Unicode characters.

Q. What the heck is a Unicode character?
A. It is an element of a coded character set defined by the Unicode standard. In a C++ program it can be represented in various ways, the most simple and straightforward one is with a single 32-bit integral value corresponding to the character's code point. (I'm ignoring composite characters here and equating "character" and "code point", unless stated otherwise, for simplicity).

Q. Given a Unicode character, how can I escape it?
A. Examine its value. If it's between 256 and 65535, print a 2-byte (4 hex digits) escape sequence. If it's greater than 65535, print a 3-byte (6 hex digits) escape sequence. Otherwise, print it as you normally would.

Q. Given a UTF-32 encoded string, how can I decompose it to characters?
A. Each element of the string (which is called a code unit) corresponds to a single character (code point). Just take them one by one. Nothing special needs to be done.

Q. Given a UTF-16 encoded string, how can I decompose it to characters?
A. Values (code units) outside of the 0xD800 to 0xDFFF range correspond to the Unicode characters with the same value. For each such value, print either a normal character or a 2-byte (4 hex digits) escape sequence. Values in the 0xD800 to 0xDFFF range are grouped in pairs, each pair representing a single character (code point) in the U+10000 to U+10FFFF range. For such a pair, print a 3-byte (6 hex digits) escape sequence. To convert a pair (v1, v2) to its character value, use this formula:

c = (v1 - 0xd800) >> 10 + (v2-0xdc00)

Note the first element of the pair must be in the range of 0xd800..0xdbff and the second one is in 0xdc00..0xdfff, otherwise the pair is ill-formed.

Q. Given a UTF-8 encoded string, how can I decompose it to characters?
A. The UTF-8 encoding is a bit more complicated than the UTF-16 one and I will not detail it here. There are many descriptions and sample implementations out there on the 'net, look them up.

Q. What's up with my L"प्रे" string?
A. It is a composite character that is composed of four Unicode code points, U+092A, U+094D, U+0930, U+0947. Note it's not the same as a high code point being represented with a surrogate pair as detailed in the UTF-16 part of the answer. It's a case of "character" being not the same as "code point". Escape each code point separately. At this level of abstraction, you are dealing with code points, not actual characters anyway. Characters come into play when you e.g. display them for the user, or compute their position in a printed text, but not when dealing with string encodings.

Kathline answered 24/5, 2014 at 10:10 Comment(6)
Is there any easy way to convert a string containing "\u60A8\u597D\u4E16\u754C" to unicode string and print it out?Underline
@Underline I don't know what the heck is "unicode string". Does my answer somehow imply that there's such thing?Masefield
If I wrote: "a sequence of Unicode characters" would that be more technically correct? Is there any easy way to convert a sequence of Unicode character escape sequences "\u60A8\u597D\u4E16\u754C" to a sequence of unicode characters (that are not encoded as character escape sequences)? Is that an OK question? Thank you!Underline
A Unicode character is a mathematical abstraction, and so is a sequence of such. To output something to a physical file/device, you want an encoded string that represents that abstraction. You need to know what encoding you want in order to start. Your string is already a UTF-16 encoded string. I suggest you ask a new question because answering in comments doesn't work.Masefield
...unless you mean "\\u..." (backslash followed by the u character), in which case you have some kind of an escape sequence that you need to parse.Masefield
@Underline You can convert an ascii string containing unicode escape sequences to utf8 and print it to a console like this: python3 -c "print('\u60A8\u597D\u4E16\u754C')" Shows something like this for me: 您好世界 If you want to do it in another language like C or C++ you'll need to find or write the appropriate parser.Prognostication

© 2022 - 2024 — McMap. All rights reserved.