What's "wrong" with C++ wchar_t and wstrings? What are some alternatives to wide characters?
Asked Answered
E

2

87

I have seen a lot of people in the C++ community(particularly ##c++ on freenode) resent the use of wstrings and wchar_t, and their use in the windows api. What is exactly "wrong" with wchar_t and wstring, and if I want to support internationalization, what are some alternatives to wide characters?

Ehf answered 19/6, 2012 at 19:0 Comment(9)
Have any references for that?Blades
@Dani well, it's from my experiences with people on ##c++ on freenodeEhf
What is their complaint?Warn
@CareyGregory that's exactly what I want to know lolEhf
@Ken Li: can you quote an argument? It's impossible to reply to an argument which we didn't hear.Blades
Perhaps this awesome thread will answer all your questions? https://mcmap.net/q/16239/-std-wstring-vs-std-stringDasilva
On Windows, you don't really have a choice. Its internal APIs were designed for UCS-2, which was reasonable at the time since it was before the variable-length UTF-8 and UTF-16 encodings were standardized. But now that they support UTF-16, they've ended up with the worst of both worlds.Lather
utf8everywhere.org has a good discussion of reasons to avoid wide characters.Newsmagazine
@Lather Certainly you have a choice. nowide library provides a convenient way to convert strings just when passing to the APIs. API calls with strings are usually low-frequency, so the reasonable way is to convert ad-hok and have files and internal variables in UTF-8 all the time.Wrongful
C
118

What is wchar_t?

wchar_t is defined such that any locale's char encoding can be converted to a wchar_t representation where every wchar_t represents exactly one codepoint:

Type wchar_t is a distinct type whose values can represent distinct codes for all members of the largest extended character set specified among the supported locales (22.3.1).

                                                                               — C++ [basic.fundamental] 3.9.1/5

This does not require that wchar_t be large enough to represent any character from all locales simultaneously. That is, the encoding used for wchar_t may differ between locales. Which means that you cannot necessarily convert a string to wchar_t using one locale and then convert back to char using another locale.1

Since using wchar_t as a common representation between all locales seems to be the primary use for wchar_t in practice you might wonder what it's good for if not that.

The original intent and purpose of wchar_t was to make text processing simple by defining it such that it requires a one-to-one mapping from a string's code-units to the text's characters, thus allowing the use of the same simple algorithms as are used with ascii strings to work with other languages.

Unfortunately the wording of wchar_t's specification assume a one-to-one mapping between characters and codepoints to achieve this. Unicode breaks that assumption2, so you can't safely use wchar_t for simple text algorithms either.

This means that portable software cannot use wchar_t either as a common representation for text between locales, or to enable the use of simple text algorithms.

What use is wchar_t today?

Not much, for portable code anyway. If __STDC_ISO_10646__ is defined then values of wchar_t directly represent Unicode codepoints with the same values in all locales. That makes it safe to do the inter-locale conversions mentioned earlier. However you can't rely only on it to decide that you can use wchar_t this way because, while most unix platforms define it, Windows does not even though Windows uses the same wchar_t locale in all locales.

The reason Windows doesn't define __STDC_ISO_10646__ is because Windows use UTF-16 as its wchar_t encoding, and because UTF-16 uses surrogate pairs to represent codepoints greater than U+FFFF, which means that UTF-16 doesn't satisfy the requirements for __STDC_ISO_10646__.

For platform specific code wchar_t may be more useful. It's essentially required on Windows (e.g., some files simply cannot be opened without using wchar_t filenames), though Windows is the only platform where this is true as far as I know (so maybe we can think of wchar_t as 'Windows_char_t').

In hindsight wchar_t is clearly not useful for simplifying text handling, or as storage for locale independent text. Portable code should not attempt to use it for these purposes. Non-portable code may find it useful simply because some API requires it.

Alternatives

The alternative I like is to use UTF-8 encoded C strings, even on platforms not particularly friendly toward UTF-8.

This way one can write portable code using a common text representation across platforms, use standard datatypes for their intended purpose, get the language's support for those types (e.g. string literals, though some tricks are necessary to make it work for some compilers), some standard library support, debugger support (more tricks may be necessary), etc. With wide characters it's generally harder or impossible to get all of this, and you may get different pieces on different platforms.

One thing UTF-8 does not provide is the ability to use simple text algorithms such as are possible with ASCII. In this UTF-8 is no worse than any other Unicode encoding. In fact it may be considered to be better because multi-code unit representations in UTF-8 are more common and so bugs in code handling such variable width representations of characters are more likely to be noticed and fixed than if you try to stick to UTF-32 with NFC or NFKC.

Many platforms use UTF-8 as their native char encoding and many programs do not require any significant text processing, and so writing an internationalized program on those platforms is little different from writing code without considering internationalization. Writing more widely portable code, or writing on other platforms requires inserting conversions at the boundaries of APIs that use other encodings.

Another alternative used by some software is to choose a cross-platform representation, such as unsigned short arrays holding UTF-16 data, and then to supply all the library support and simply live with the costs in language support, etc.

C++11 adds new kinds of wide characters as alternatives to wchar_t, char16_t and char32_t with attendant language/library features. These aren't actually guaranteed to be UTF-16 and UTF-32, but I don't imagine any major implementation will use anything else. C++11 also improves UTF-8 support, for example with UTF-8 string literals so it won't be necessary to trick VC++ into producing UTF-8 encoded strings (although I may continue to do so rather than use the u8 prefix).

Alternatives to avoid

TCHAR: TCHAR is for migrating ancient Windows programs that assume legacy encodings from char to wchar_t, and is best forgotten unless your program was written in some previous millennium. It's not portable and is inherently unspecific about its encoding and even its data type, making it unusable with any non-TCHAR based API. Since its purpose is migration to wchar_t, which we've seen above isn't a good idea, there is no value whatsoever in using TCHAR.


1. Characters which are representable in wchar_t strings but which are not supported in any locale are not required to be represented with a single wchar_t value. This means that wchar_t could use a variable width encoding for certain characters, another clear violation of the intent of wchar_t. Although it's arguable that a character being representable by wchar_t is enough to say that the locale 'supports' that character, in which case variable-width encodings aren't legal and Window's use of UTF-16 is non-conformant.

2. Unicode allows many characters to be represented with multiple code points, which creates the same problems for simple text algorithms as variable width encodings. Even if one strictly maintains a composed normalization, some characters still require multiple code points. See: http://www.unicode.org/standard/where/

Cementite answered 19/6, 2012 at 19:1 Comment(20)
Addition: utf8everywhere.org recommends using UTF-8 on Windows, and Boost.Nowide is scheduled for formal review.Marketa
The best thing, of course, is to use C# or VB.Net on Windows :) Or plain old C/Win32. But if you must use C++, then TCHAR is the best way to go. Which defaults to "wchar_t" on MSVS2005 and higher. IMHO...Intelligibility
@ybungalobill: agree; but if you're writing to the Win32 API, then portability likely isn't a concern anyhow - see winapi tag in the OQ. paulsm4: TCHAR really is for code that needed to compile as both ANSI or UNICODE back in the Win95 days, and there's no good reason to use it today; for Win32 code written today, just use WCHAR. (Again, that's the Win32 way of doing things; much of the rest of the world uses char encoded as UTF-8.)Tani
@BrendanMcK: Sure, code that uses Win32 API on windows and other APIs on other systems doesn't exist. Right? The problem with microsoft's approach ("use wchar internally everywhere in your app") is that affects even code that doesn't interface the system directly and could be portable.Marketa
The problem is that you have to use Windows-specific functions because Microsoft's decision not to support UTF-8 as an ANSI code page "breaks" the Standard C(++) Library. For example, you can't fopen a file whose name contains non-ANSI characters.Phthalein
@Phthalein Yes, you can't use the standard library on Windows, but you can create a portable interface that wraps the standard library on other platforms and converts from UTF-8 to wchar_t directly before using Win32 W functions.Cementite
Windows does not support UTF-8 because, for backward compatibility reasons (e.g. decades old programs still working on Windows 8), the char type is reserved for the locale. This way, old programs still work great on the locales they were coded for. Linux doesn't have this problem because no one there expects a 10 years old program whose sources you lost to still launch and work seamlessly on your latest distribution. Different constraints, different solutions.Prepositor
@Prepositor Of course if MS wanted to they could support UTF-8 as a locale encoding. Programs that handle locales correctly would then work with UTF-8.Cementite
The fact is: MS does support UTF-8 (see msdn.microsoft.com/en-us/library/windows/desktop/dd374130.aspx ). Windows has to support decades old code that have no idea about UTF-8. And the average user on Windows doesn't know about UTF-8 either. The only solution I see would be to "mark" some programs as UTF-8 based (in the same way as LARGEADDRESSAWARE), and Windows would make the translation on-the-fly for the programs with that mark... But I guess this is not a priority for the Windows dev team, probably because there's not enough Windows user who cares...Prepositor
@Prepositor I said "as a locale encoding". Decades old code that handles locales correctly would work with a new locale. User's don't need to know. (Also it's funny that your two comments contradict each other.)Cementite
@Cementite : Decades old code that handles locales correctly would work with a new locale. No, they woudldn't. Let's say I wrote a program 15 years ago that prints olé will use character 233 for é. When your UTF-8 Windows will launch this program, it will have no hint about the language I output, so it will assume the char 233 is some kind of UTF-8 typo, and print ol�. There's nothing the old program can do to tell it is french, and there's nothing the new Windows can do to recognize it is french. So the output is bad, and the customer blames Microsoft.Prepositor
@Prepositor specifying the numeric value for the character and blindly outputting it means the program is not correctly handling encodings. The correct thing to do is to use the Windows API to query the output codepage and to convert the string to that. The old program's output was broken 15 years ago: e.g., an Asian customer 15 years ago would have seen garbled output similar to what you show.Cementite
@Cementite : No, the code is correct. My 15-years old program worked correctly on any french Windows, 15 years ago, and the user expects it to continue to work well on a current french Windows. So it is correct. Breaking that program just because a few people want the support for UTF-8 makes no sense from the Windows customer viewpoint. The user doesn't give a damn about the fact the program was wrong according to you. And the customer won't upgrade his/her windows if the program stops working. So Microsoft is right in ignoring a minority (even if they are right) in favor of its customers.Prepositor
@Prepositor "It works for me" is not sufficient to demonstrate that it's correct. Furthermore, if the program doesn't work under a new locale there's an easy solution: don't use the new locale and instead just stick with the same default that you've always used to run that program.Cementite
@Cementite : No, it works for all users, today as it worked 15 years ago. And the user doesn't need to do anything (including installing new languages, or whatever configuration half of them are unable or unwilling to do). Everything else, including the UTF-8 locale, and nifty cross-pltaform code, doesn't matter for them. This is a reality in the Windows world. This (the binary backward compatibility) is the point. And this is because Microsoft uses the char type for legacy applications.Prepositor
@Prepositor Like I said, that's not sufficient to demonstrate correctness. And of course it doesn't even work for, e.g., Asian Windows users. I never claimed that adding a UTF-8 locale would somehow benefit old programs that can't be compiled anymore. What I claim is that Microsoft could add a UTF-8 locale and that doing so would not impact legacy programs. In any case utf8everywhere.org works today even without locale support. Treating char as only legacy encodings is not a good idea. Even MS uses char for UTF-8.Cementite
@Cementite : Let's wrap it up: You claim Microsoft could correct this, I claim they won't because of backward compatibility. You claim I do care if my old char-based french (resp. chinese) program should work seamlessly on a chinese (resp. french) Windows, I tell you I don't, and my french (resp. chinese) customer doesn't, either. You claim utf8everywhere.org is the solution, I only see there gratuitous, biased claims. Quoting Sovereign, "This exchange is over".Prepositor
@Prepositor I didn't claim Microsoft would add a UTF-8 locale, and I don't expect them to. I didn't claim you cared; Lots of programers didn't care 10 years ago, and that's why their code has to be fixed when it turns out the code needs to work in more than just one region, and that's why I've spent a fair bit of time correcting such code and consulting with others on how they can avoid this mistake in the future.Cementite
Nice exchange; one of the reasons to love this site -- shows experienced folks' misconceptions :)Erwin
@Prepositor MS has already added UTF-8 locale support on WindowsClinandrium
I
21

There's nothing "wrong" with wchar_t. The problem is that, back in NT 3.x days, Microsoft decided that Unicode was Good (it is), and to implement Unicode as 16-bit, wchar_t characters. So most Microsoft literature from the mid-90's pretty much equated Unicode == utf16 == wchar_t.

Which, sadly, is not at all the case. "Wide characters" are not necessarily 2 bytes, on all platforms, under all circumstances.

This is one of the best primers on "Unicode" (independent of this question, independent of C++) I've ever seen: I highly recommend it:

And I honestly believe the best way to deal with "8-bit ASCII" vs "Win32 wide characters" vs "wchar_t-in-general" is simply to accept that "Windows is Different" ... and code accordingly.

IMHO...

PS:

I totally agree with jamesdlin above:

On Windows, you don't really have a choice. Its internal APIs were designed for UCS-2, which was reasonable at the time since it was before the variable-length UTF-8 and UTF-16 encodings were standardized. But now that they support UTF-16, they've ended up with the worst of both worlds.

Intelligibility answered 25/6, 2012 at 21:52 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.