There are many questions within your question, I will try to answer the most important ones.
Q. I have a C++ string like "Eat, drink, 愛"
, is it a UT8-8, UTF-16 or UTF-32 string?
A. This is implementation-defined. In many implementations this will be a UTF-8 string, but this is not mandated by the standard. Consult your documentation.
Q. I have a wide C++ string like L"Eat, drink, 愛"
, is it a UT8-8, UTF-16 or UTF-32 string?
A. This is implementation-defined. In many implementations this will be a UTF-32 string. In some other implementations it will be a UTF-16 string. Neither is mandated by the standard. Consult your documentation.
Q. How can I have portable UT8-8, UTF-16 or UTF-32 C++ string literals?
A. In C++11 there is a way:
u8"I'm a UTF-8 string."
u"I'm a UTF-16 string."
U"I'm a UTF-32 string."
In C++03, no such luck.
Q. Does the string "Eat, drink, 愛"
contain at least one UTF-32 character?
A. There are no such things as UTF-32 (and UTF-16 and UTF-8) characters. There are UTF-32 etc. strings. They all contain Unicode characters.
Q. What the heck is a Unicode character?
A. It is an element of a coded character set defined by the Unicode standard. In a C++ program it can be represented in various ways, the most simple and straightforward one is with a single 32-bit integral value corresponding to the character's code point. (I'm ignoring composite characters here and equating "character" and "code point", unless stated otherwise, for simplicity).
Q. Given a Unicode character, how can I escape it?
A. Examine its value. If it's between 256 and 65535, print a 2-byte (4 hex digits) escape sequence. If it's greater than 65535, print a 3-byte (6 hex digits) escape sequence. Otherwise, print it as you normally would.
Q. Given a UTF-32 encoded string, how can I decompose it to characters?
A. Each element of the string (which is called a code unit) corresponds to a single character (code point). Just take them one by one. Nothing special needs to be done.
Q. Given a UTF-16 encoded string, how can I decompose it to characters?
A. Values (code units) outside of the 0xD800 to 0xDFFF range correspond to the Unicode characters with the same value. For each such value, print either a normal character or a 2-byte (4 hex digits) escape sequence. Values in the 0xD800 to 0xDFFF range are grouped in pairs, each pair representing a single character (code point) in the U+10000 to U+10FFFF range. For such a pair, print a 3-byte (6 hex digits) escape sequence. To convert a pair (v1, v2) to its character value, use this formula:
c = (v1 - 0xd800) >> 10 + (v2-0xdc00)
Note the first element of the pair must be in the range of 0xd800..0xdbff and the second one is in 0xdc00..0xdfff, otherwise the pair is ill-formed.
Q. Given a UTF-8 encoded string, how can I decompose it to characters?
A. The UTF-8 encoding is a bit more complicated than the UTF-16 one and I will not detail it here. There are many descriptions and sample implementations out there on the 'net, look them up.
Q. What's up with my L"प्रे" string?
A. It is a composite character that is composed of four Unicode code points, U+092A, U+094D, U+0930, U+0947. Note it's not the same as a high code point being represented with a surrogate pair as detailed in the UTF-16 part of the answer. It's a case of "character" being not the same as "code point". Escape each code point separately. At this level of abstraction, you are dealing with code points, not actual characters anyway. Characters come into play when you e.g. display them for the user, or compute their position in a printed text, but not when dealing with string encodings.
Eat, drink, 愛
? Your question seems as if you're storing it as a byte array with unspecified encoding, and you're attempting to guess what encoding it might be, not for the whole array, but for each individual character. – LidaEat, drink, 愛
and store them to disk in their escaped literal formEat, drink, \u611b
(UTF-16 example) If my program finds a UTF-32 character it should escape those too in the form like\U8902611b
(UTF-32 example), but I can't find a certain way of knowing if I'm dealing with UTF-16 or UTF-32 in an input byte array. – Stockbreederchar32_t
or wide string literal is the total number of escape sequences, universal-character-names, and other characters, plus one for the terminatingU'\0'
orL'\0'
." That means any implementation that uses UTF-16 forwchar_t
is non-conforming, but implementations are allowed to use UCS-2 forwchar_t
. Are you sure the implementations you're referring to don't actually use UCS-2? – Lida