Does std::wstring support UTF-16 and UTF-32 on Windows?
Asked Answered
N

2

6

I'm learning about Unicode and have a few questions that I'm hoping to get answered.

1) I've read that on Linux, a std::wstring is 4-bytes, while on Windows, it's 2-bytes. Does this mean that Linux internal support is UTF-32 while Windows it is UTF-16?

2) Is the use of std::wstring very similar to the std::string interface?

3) Does VC++ offer support for using a 4-byte std::wstring?

4) Do you have to change compiler options if you use std::wstring?

As a sidenote, I came across a string library for working with UTF-8 which has a very similar interface to std::string which provides familiar functionality such as length, substr, find, upper/lower case conversion etc. The library is Glib::ustring.

Please feel free to add any comments or additional advice, because I really need it.

Thank you!

Nmr answered 19/9, 2014 at 16:23 Comment(3)
The C++11 char32_t type should provide a solution, it is however dead last on the Microsoft to-do list. Pretty doubtful that it will get any use in the next 10 years :) Yes, you almost always need a library to do anything non-trivial. ICU is a common choice.Substitutive
@HansPassant, this was causing me quite a bit of VC++ confusion with undeclared identifiers and such. ICU is so big...I think that I'm going to try Glib::ustring to see if it satisfies all of my needs. BTW, a while back you warned me about codepages and you were so right.Nmr
Pretty much everything about it is written in utf8everywhere.org manifesto.Kelliekellina
R
5

1) wstring is a basic_string<wchar_t> and the size of wchar_t is implementation dependent and encoding agnostic (the standard just says that "its values can represent distinct codes for all members of the largest extended character set specified among the supported locales". But yes, an implementation that has sizeof(wchar_t)=4 bytes supports UTF-32, and sizeof(wchar_t)=2 bytes supports UTF-16.

2) wstring is a basic_string<wchar_t> whereas string is a basic_string<char>, so yes, it is a very similar interface. You will have to use wcout, wcin and wfstream though, and have some other constraints like this.

3) No, MSVC defines wchar_t as unsigned short, which defines and limits wstring as you said. MSVC gives possibility of handling wchar_t as a typedef instead of an internal type. You could imagine then to redefine the typedef, but I suspect this is extreamly risky and evil.

4) No, it's up to you to choose to the string type you want.

5) UTF-32 and the standard : Interestingly, in the very encoding agnostic C++ standard, UTF-32 is mentionned explicitely only for codecvt: "the specialization codecvt <char32_t, char, mbstate_t> converts between the UTF-32 and UTF-8 encoding forms. codecvt converts between the native character sets for narrow and wide characters." This suggests that char32_t would be the portable approach to UTF-32. Unfortunately MSVC doesn't support this type yet.

Rochelle answered 19/9, 2014 at 16:35 Comment(3)
At Christophe and @RemyLebeau, thank you for the excellent answers.Nmr
"MSVC defines wchar_t as unsigned short" By default, wchar_t is a builtin type. The page you linked to explains that.Ogilvie
@DDrmmr, yes, my wording is misleading. Of course, with the default options of the compiler, MSVC handles wchar_t as a native type. What I meant was that the MSVC impelmentation --whatever the option-- gives this type the same caracteristics/limits as an unsigned short.Rochelle
I
7

1) I've read that on Linux, a std::wstring is 4-bytes, while on Windows, it's 2-bytes. Does this mean that Linux internal support is UTF-32 while Windows it is UTF-16?

It is actually wchar_t, not std::wstring, that is 4 bytes on Linux and 2 bytes on Windows. std::wstring is a typedef for std::basic_string<wchar_t>, so std::wstring supports UTF-32 on Linux and UTF-16 on Windows, yes.

2) Is the use of std::wstring very similar to the std::string interface?

Both std::wstring and std::string are typedefs of std:basic_string, so they have the same interface, just different value_type types (wchar_t vs char, respectively).

3) Does VC++ offer support for using a 4-byte std::wstring?

Not for std::wstring itself, no. But you can create your own std::basic_string typedef, eg:

typedef std::basic_string<int32_t> u32string;

In fact, this is exactly how the new C++11 std::u16string and std::u32string types are defined:

typedef std::basic_string<char16_t> u16string;
typedef std::basic_string<char32_t> u32string;

It is also not unheard of to make a typedef of std::basic_string for TCHAR:

typedef std::basic_string<TCHAR> tstring;

As a sidenote, I came across a string library for working with UTF-8 which has a very similar interface to std::string which provides familiar functionality such as length, substr, find, upper/lower case conversion etc. The library is Glib::ustring.

Technically speaking, you can (and many people do) use a standard std::string for UTF-8. Glib::ustring just takes it further by using gunichar (a typedef for guint32) instead of char, and exposes its interfaces to operate in terms of raw Unicode codepoints instead of encoded codeunits.

Imena answered 19/9, 2014 at 17:33 Comment(0)
R
5

1) wstring is a basic_string<wchar_t> and the size of wchar_t is implementation dependent and encoding agnostic (the standard just says that "its values can represent distinct codes for all members of the largest extended character set specified among the supported locales". But yes, an implementation that has sizeof(wchar_t)=4 bytes supports UTF-32, and sizeof(wchar_t)=2 bytes supports UTF-16.

2) wstring is a basic_string<wchar_t> whereas string is a basic_string<char>, so yes, it is a very similar interface. You will have to use wcout, wcin and wfstream though, and have some other constraints like this.

3) No, MSVC defines wchar_t as unsigned short, which defines and limits wstring as you said. MSVC gives possibility of handling wchar_t as a typedef instead of an internal type. You could imagine then to redefine the typedef, but I suspect this is extreamly risky and evil.

4) No, it's up to you to choose to the string type you want.

5) UTF-32 and the standard : Interestingly, in the very encoding agnostic C++ standard, UTF-32 is mentionned explicitely only for codecvt: "the specialization codecvt <char32_t, char, mbstate_t> converts between the UTF-32 and UTF-8 encoding forms. codecvt converts between the native character sets for narrow and wide characters." This suggests that char32_t would be the portable approach to UTF-32. Unfortunately MSVC doesn't support this type yet.

Rochelle answered 19/9, 2014 at 16:35 Comment(3)
At Christophe and @RemyLebeau, thank you for the excellent answers.Nmr
"MSVC defines wchar_t as unsigned short" By default, wchar_t is a builtin type. The page you linked to explains that.Ogilvie
@DDrmmr, yes, my wording is misleading. Of course, with the default options of the compiler, MSVC handles wchar_t as a native type. What I meant was that the MSVC impelmentation --whatever the option-- gives this type the same caracteristics/limits as an unsigned short.Rochelle

© 2022 - 2024 — McMap. All rights reserved.