Is wchar_t needed for unicode support?
Asked Answered
P

8

40

Is the wchar_t type required for unicode support? If not then what's the point of this multibyte type? Why would you use wchar_t when you could accomplish the same thing with char?

Portentous answered 13/2, 2010 at 23:32 Comment(1)
Related: https://mcmap.net/q/146558/-why-was-wchar_t-inventedFireside
I
39

No.

Technically, no. Unicode is a standard that defines code points and it does not require a particular encoding.

So, you could use unicode with the UTF-8 encoding and then everything would fit in a one or a short sequence of char objects and it would even still be null-terminated.

The problem with UTF-8 and UTF-16 is that s[i] is not necessarily a character any more, it might be just a piece of one, whereas with sufficiently wide characters you can preserve the abstraction that s[i] is a single character, tho it does not make strings fixed-length under various transformations.

32-bit integers are at least wide enough to solve the code point problem but they still don't handle corner cases, e.g., upcasing something can change the number of characters.

So it turns out that the x[i] problem is not completely solved even by char32_t, and those other encodings make poor file formats.

Your implied point, then, is quite valid: wchar_t is a failure, partly because Windows made it only 16 bits, and partly because it didn't solve every problem and was horribly incompatible with the byte stream abstraction.

Imminent answered 13/2, 2010 at 23:59 Comment(5)
Unicode only recently (4.0?) added more than 65536 code points. A conforming C++ implementation therefore has to choose: support only Unicode 3.x and 16 bits wchar_t, or use a 32 bits wchar_t. Using UTF-16 is technically non-conforming as there's no such thing as a "nullnull-terminated multi-wchar_t" encoding.Dupondius
Characters outside the BMP were first assigned in Unicode 3.1, in 2001.Darice
"The problem with UTF-8 is that ..." IMO that's not a problem at all. The problem with wchar_t, on the other hand, is that it gives the false illusion that this one-wchar_t-equals-one-UNICODE-character abstraction still hold true when it is clearly not the case. This just promotes buggy code that breaks down the moment the program has to deal with characters that violate this false assumption.Sniperscope
"The problem with UTF-8" that you named is the exact same problem with UTF-16. Your answer gives the impression that wchar, which is 16-bit in some systems, is less problematic in this regard. There is no middle ground where you can "mostly preserve" an abstraction—you either correctly handle UTF-16 surrogate pairs, or you don't, in which case your program is broken.Connective
Ok, I've updated things to note the progressive failure of the various attempts at fixed-length encodings. My earlier description was accurate in that the fixed-length encodings did work for a while but people didn't get the implied timeline.Imminent
S
9

As has already been noted, wchar_t is absolutely not necessary for unicode support. Not only that, it is also utterly useless for that purpose, since the standard provides no fixed-size guarantee for wchar_t (in other words, you don't know ahead of time what sizeof( wchar_t ) will be on a particular system), whereas sizeof( char ) will always be 1.

In a UTF-8 encoding, any actual UNICODE character is mapped to a sequence of one or more (up to four, I believe) octets. In a UTF-16 encoding, any actual UNICODE character is mapped to a sequence of one or more (up to two, I believe) 16-bit words. In a UTF-32 encoding, any actual UNICODE character is mapped to exactly one 32-bit-word.

As you can see, wchar_t could be of some use for implementing UTF-16 support IF the standard was nice enough to guarantee that wchar_t is always 16 bits wide. Unfortunately it does not, so you'd have to revert to a fixed-width integer type from <cstdint> (such as std::uint16_t) anyway.

<slightly OffTopic Microsoft-specific rant>

What's more infuriating is the additional confusion caused by Microsoft's Visual Studio UNICODE and MBCS (multi-byte character set) build configurations. Both of these are

A) confusing and B) an outright lie

because neither does a "UNICODE" configuration in Visual Studio do anything to buy the programmer actual Unicode support, nor does the difference implied by these 2 build configurations make any sense. To explain, Microsoft recommends using TCHAR instead of using char or wchar_t directly. In an MBCS configuration, TCHAR expands to char, meaning you could potentially use this to implement UTF-8 support. In a UNICODE configuration, it expands to wchar_t, which in Visual Studio happens to be 16 bits wide and could potentially be used to implement UTF-16 support (which, as far as I'm aware, is the native encoding used by Windows). However, both of these encodings are multi-byte character sets, since both UTF-8 and UTF-16 allow for the possibility that a particular Unicode character may be encoded as more than a one char/wchar_t respectively, so the term multi-byte character set (as opposed to single-byte character set?) makes little sense.

To add insult to injury, merely using the Unicode configuration does not actually give you one iota of Unicode support. To actually get that, you have to use an actual Unicode library like ICU ( http://site.icu-project.org/ ). In short, the wchar_t type and Microsoft's MBCS and UNICODE configurations add nothing of any use and cause unnecessary confusion, and the world would be a significantly better place if none of them had ever been invented.

</slightly OffTopic Microsoft-specific rant>
Sniperscope answered 26/3, 2015 at 16:12 Comment(0)
B
8

You absolutely do not need wchar_t to support Unicode in the software, in fact using wchar_t makes it even harder because you do not know if "wide string" is UTF-16 or UTF-32 -- it depends on OS: under windows utf-16 all others utf-32.

However, utf-8 allows you to write Unicode enabled software easily(*)

See: https://stackoverflow.com/questions/1049947/should-utf-16-be-considered-harmful

(*) Note: under Windows you still have to use wchar_t because it does not support utf-8 locales so for unicode enabled windows programming you have to use wchar based API.

Bibliographer answered 14/2, 2010 at 10:57 Comment(0)
C
5

wchar_t is absolutely NOT required for Unicode. UTF-8 for example, maintains backward compatibility with ASCII and uses plain 8-bit char. wchar_t mostly yields support for so-called multi-byte characters, or basically any character set that's encoded using more than the sizeof(char).

Chen answered 13/2, 2010 at 23:54 Comment(6)
It sounds like you are implying that UTF-8 encodes all characters as 8-bits, which is not only untrue, but if true would be quite a feat of data compression. UTF-8 is a multi-byte encoding: some characters are encoded using 8-bits, some using 16-bits, some using 24-bits and some using 32-bits. It can support (though it's not currently needed, I think) characters encoded using up to 48-bits.Misname
"It sounds like you are implying that UTF-8 encodes all characters as 8-bits" -- No it doesn't.Beneficiary
"wchar_t mostly yields support for so-called multi-byte characters" - You are confusing "multi-byte" with "variable width". "Variable width" is an intrinsic feature of both UTF-8 and UTF-16. No difference there. Besides, the C++ Standard does not mandate any particular encoding for wchar_t. A compiler could opt to make it UTF-32, yielding a fixed-width character encoding. This answer is fairly misleading, and entirely not useful. -1.Twentyfour
@DanMoulding It sounds like you take one thing and then decide to speak for someone else by adding to it so then you can criticise the original point. That's not only unreasonable but it's a fallacy. No. The answer doesn't imply that at all. It's YOU that wants it to be or believes it to be or whatever other possibilities. But it's not actually implied in the answer.Forestaysail
@Forestaysail "wchar_t mostly yields support for so-called multi-byte characters, or basically any character set that's encoded using more than the sizeof(char)" It says right there wchar_t is for supporting character sets that are encoded using more than sizeof(char) (8-bits). That implies that UTF-8 does not need wchar_t because it doesn't use more than 8-bits to encode characters. The purpose of my comment was to add clarity for those less familiar with this topic, to avoid confusion. I'm not sure what the purpose of your comment is, other than to criticize some imagined malice.Misname
@DanMoulding I know that. But you're still reading in more than what the answer says. If my comment read as malicious I apologise - it wasn't meant to. I am naturally one to satirise and though I try my best sometimes I don't get it clear. It's fine that you're trying to add clarity. In fact that's great. I just think it's not right or fair to imply that the post was implying something else. Anyway I'm sorry if I caused offence - it was not meant to be anything but calling out an unfortunately far too common fallacy of reading more than what's intended (something we're all susceptible to too).Forestaysail
P
3

wchar_t is not required. It's not even guaranteed to have a specific encoding. The point is to provide a data type that represents the wide characters native to your system, similar to char representing native characters. On Windows, for example, you can use wchar_t to access the wide character Win32 API functions.

Plaided answered 13/2, 2010 at 23:39 Comment(0)
D
3

Be careful, wchar_t is often 16bits which is not enough to store all unicode characters and is a bad choice ofr data in UTF_8 for instance

Dusk answered 13/2, 2010 at 23:55 Comment(2)
This is not true on Linux (or, I assume other Unix-ish systems), where it's 32 bits. It depends on the compiler and runtime.Placenta
@Placenta The point of saying that wchar_t is "not enough to store all unicode characters" is that the program does not portably gain the simplicity of a fixed-width encoding with it.Gesticulatory
G
2

Because you can't accomplish the same thing with char:

The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

Gonzalogoo answered 13/2, 2010 at 23:40 Comment(4)
As the title of that post says, this is something every developer absolutely, positively must know about unicode. For that reason alone, I wish I could give more than a single upvote. :)Wandy
The reference is good but the statement is actually quite false. Using UTF-8 to map Unicode to legacy char is not only possible but quite likely the single most common encoding.Imminent
zdawg is right. You do not need wchar_t to implement Unicode properly, and using it will not necessarily even help. For one thing, the wchar_t can be as small as 8 bits. On Windows it is 16, which means you can represent a UTF-16 code unit, but /not all charaters/. That's why the Unicode standard says "Consequently, programs that need to be portable across any C or C++ compiler should not use wchar_t for storing Unicode text." You can use char, as long as you treat char as meaning "byte."Grizel
-1 wchar is one of the worth things invented because wchar_t may be 4 or 2 bytes.Bibliographer
R
0

char is generally a single byte. (sizeof(char) must be equal to 1).

wchar_t was added to the language specifically to suppose multibyte characters.

Revealment answered 13/2, 2010 at 23:36 Comment(1)
The C and C++ definitions of "byte" are "amount of memory taken by a single char". No need for weasel worlds like "generally" here. It might not be an octet (8 bits) though.Dupondius

© 2022 - 2024 — McMap. All rights reserved.