Yes, the UTF-8 encoding can be used with char, string, and stringstream. A char will hold a single UTF-8 code unit, of which up to four may be required to represent a single Unicode code point.
However, there are a few issues using UTF-8 specifically with Microsoft's compilers. C++ implementations use an 'execution character set' for a number of things, such as encoding character and string literals. VC++ always use the system locale encoding as the execution character set, and Windows does not support UTF-8 as the system locale encoding, therefore UTF-8 can never by the execution character set.
This means that VC++ never intentionally produces UTF-8 character and string literals. Instead the compiler must be tricked.
Edit: More recent versions of Microsoft's C++ compiler support UTF-8 source and using UTF-8 as the execution encoding. Windows also has at least a beta setting to use UTF-8 as the system locale encoding. See here.
The compiler will convert from the known source code encoding to the execution encoding. That means that if the compiler uses the locale encoding for both the source and execution encodings then no conversion is done. If you can get UTF-8 data into the source code but have the compiler think that the source uses the locale encoding, then character and string literals will use the UTF-8 encoding. VC++ uses the so-called 'BOM' to detect the source encoding, and uses the locale encoding if no BOM is detected. Therefore you can get UTF-8 encoded string literals by saving all your source files as "UTF-8 without signature".
There are caveats with this method. First, you cannot use UCNs with narrow character and string literals. Universal Character Names have to be converted to the execution character set, which isn't UTF-8. You must either write the character literally so it appears as UTF-8 in the source code, or you can use hex escapes where you manually write out a UTF-8 encoding. Second, in order to produce wide character and string literals the compiler performs a similar conversion from the source encoding to the wide execution character set (which is always UTF-16 in VC++). Since we're lying to the compiler about the encoding, it will perform this conversion to UTF-16 incorrectly. So in wide character and string literals you cannot use non-ascii characters literally, and instead you must use UCNs or hex escapes.
UTF-8 is variable length (as is UTF-16). The indices used with at()
and substr()
are code units rather than character or code point indices. So if you want a particular code unit then you can just index into the string or array or whatever as normal. If you need a particular code point then you either need a library that can understand composing UTF-8 code units into code points (such as the Boost Unicode iterators library), or you need to convert the UTF-8 data into UTF-32. If you need actual user perceived characters then you need a library that understands how code points are composed into characters. I imagine ICU has such functionality, or you could implement the Default Grapheme Cluster Boundary Specification from the Unicode standard.
The above consideration of UTF-8 only really matters for how you write Unicode data in the source code. It has little bearing on the program's input and output.
If your requirements allow you to choose how to do input and output then I would still recommend using UTF-8 for input. Depending on what you need to do with the input you can either convert it to another encoding that's easy for you to process, or you can write your processing routines to work directly on UTF-8.
If you want to ever output anything via the Windows console then you'll want a well defined module for output that can have different implementations, because internationalized output to the Windows console will require a different implementation from either outputting to a file on Windows or console and file output on other platforms. (On other platforms the console is just another file, but the Windows console needs special treatment.)
wchar_t
is a good fit for UTF-16. On Linux, it is appropriate to use for UTF-32. For UTF-8,char
is a pretty reasonable candidate (unless you've got access to the "new" character types in C++11) – Muenster