1) I've read that on Linux, a std::wstring is 4-bytes, while on Windows, it's 2-bytes. Does this mean that Linux internal support is UTF-32 while Windows it is UTF-16?
It is actually wchar_t
, not std::wstring
, that is 4 bytes on Linux and 2 bytes on Windows. std::wstring
is a typedef for std::basic_string<wchar_t>
, so std::wstring
supports UTF-32 on Linux and UTF-16 on Windows, yes.
2) Is the use of std::wstring very similar to the std::string interface?
Both std::wstring
and std::string
are typedefs of std:basic_string
, so they have the same interface, just different value_type
types (wchar_t
vs char
, respectively).
3) Does VC++ offer support for using a 4-byte std::wstring?
Not for std::wstring
itself, no. But you can create your own std::basic_string
typedef, eg:
typedef std::basic_string<int32_t> u32string;
In fact, this is exactly how the new C++11 std::u16string
and std::u32string
types are defined:
typedef std::basic_string<char16_t> u16string;
typedef std::basic_string<char32_t> u32string;
It is also not unheard of to make a typedef of std::basic_string
for TCHAR
:
typedef std::basic_string<TCHAR> tstring;
As a sidenote, I came across a string library for working with UTF-8 which has a very similar interface to std::string which provides familiar functionality such as length, substr, find, upper/lower case conversion etc. The library is Glib::ustring.
Technically speaking, you can (and many people do) use a standard std::string
for UTF-8. Glib::ustring
just takes it further by using gunichar
(a typedef for guint32
) instead of char
, and exposes its interfaces to operate in terms of raw Unicode codepoints instead of encoded codeunits.