TL;DR:
Why there are no unsigned wchar_t and signed wchar_t types?
Because C's wide-character handling facilities were defined such that they are not needed.
In more detail,
The signedness of char is not standardized.
To be precise, "The implementation shall define char to have the same range, representation, and behavior as either signed char or unsigned char." (C2011, 6.2.5/15)
Hence there are signed char
and unsigned char
types.
"Hence" implies causation, which would be hard to argue clearly, but certainly signed char
and unsigned char
are more appropriate when you want to handle numbers, as opposed to characters.
Therefore functions which work with single character must use the argument type which can hold both signed char and unsigned char
No, not at all. Standard library functions that work with individual characters could easily be defined in terms of type char
, regardless of whether that type is signed, because the library implementation does know its signedness. If that were a problem then it would apply equally to the string functions, too -- char
would be useless.
Your example of getchar()
is non-apposite. It returns int
rather than a character type because it needs to be able to return an error indicator that does not correspond to any character. Moreover, the code you present does not correspond to the accompanying warning message: it contains a conversion from int
to unsigned char
, but no conversion from char
to unsigned char
.
Some other character-handling functions accept int
parameters or return values of type int
both for compatibility with getchar()
and other stdio functions, and for historic reasons. In days of yore, you couldn't actually pass a char
at all -- it would always be promoted to int
, and that is what the functions would (and must) accept. One cannot later change the argument type, evolution of the language notwithstanding.
Further, the ISO C90 standard, where wchar_t
was introduced, does not say anything specific about the representation of wchar_t
.
C90 isn't really relevant any longer, but no doubt it says something very similar to C2011 (7.19/2), which describes wchar_t
as
an integer type whose range of values can represent distinct codes for all members of the largest extended character set specified among the supported locales [...].
Your quotations from the glibc reference are non-authoritative, except possibly for glibc only. They appear in any case to be commentary, not specification, and its unclear why you raise them. Certainly, though, at least the first is correct. Referring to the standard, if all the members of the largest extended character set specified among the locales supported by a given implementation could fit in a char
then that implementation could define wchar_t
as char
. Such implementations used to be much more common than they are today.
You ask several questions:
Private communication reveals that an implementation is allowed to support wide characters with >=0 value only (independently of signedness of wchar_t
). Anybody knows what this means?
I think it means that whoever communicated that to you doesn't know what they are talking about, or perhaps that what they are talking about is something different than the requirements placed by the C standard. You will find that in practice, character sets are defined with only non-negative character codes, but that is not a constraint placed by the C standard.
Does thin mean that when wchar_t
is 16-bit type (for example), we can only use 15 bits to store the value of wide character?
The C standard does not say or imply that. You can store the value of any supported character in a wchar_t
. In particular, if an implementation supports a character set containing character codes exceeding 32767, then you can store those in a wchar_t
.
In other words, is it true that a sign-extended wchar_t is a valid value?
The C standard does not say or imply that. It does not even say whether wchar_t
is a signed type (if not, then sign extension is meaningless for it). If it is a signed type, then there is no guarantee about whether sign-extending a value representing a character in some supported character set (which value could, in principle, be negative) will produce a value that also represents a character in that character set, or in any other supported character set. The same is true of adding 1 to a wchar_t
value.
Also, private communication reveals that the standard requires that any valid value of wchar_t
must representable by wint_t
. Is it true?
It depends what you mean by "valid". The standard says that wint_t
is an integer type unchanged by default argument promotions that can hold any value corresponding to members of the extended character set, as well as at least one value that does not correspond to any member of the extended character set.
(C2011, 7.29.1/2)
wchar_t
must be able to hold any value corresponding to a member of the extended character set, in any supported locale. wint_t
must be able to hold all of those values, too. It may be, however, that wchar_t
is capable of representing values that do not correspond to any character in any supported character set. Such values are valid in the sense that the type can represent them. wint_t
is not required to be able to represent such values.
For example, if the largest extended character set of any supported locale uses character codes up to but not exceeding 32767, then an implementation would be free to implement wchar_t
as an unsigned 16-bit integer, and wint_t
as a signed 16-bit integer. The values representable by wchar_t
that do not correspond to extended characters are then not representable by wint_t
(but wint_t
still has many candidates for its required value that does not correspond to any character).
With respect to the character and wide-character classification functions, the only answer is that the differences simply arise from different specifications. The char
classification functions are defined to work with the same values that getchar()
is defined to return -- either -1 or a character value converted, if necessary, to unsigned char
. The wide character classification functions, on the other hand, accept arguments of type wint_t
, which can represent the values of all wide-character unchanged, therefore there is no need for a conversion.
You claim in this regard that
We need to use iswlower((unsigned wchar_t)wc)
here, but there is no unsigned wchar_t
type.
No and maybe. You do not need to convert the wchar_t
argument to iswlower()
to any other type, and in particular, you do not need to convert it to an explicitly unsigned type. The wide character classification functions are not analogous to the regular character classification functions in this respect, having been designed with the benefit of hindsight. As for unsigned wchar_t
, C does not require such a type to exist, so portable code should not use it, but it may exist in some implementations.
Regarding the update appended to the question:
Are the standards saying that casting to unsigned int and to int in the following two programs is guaranteed to be correct? (I just replaced wint_t and wchar_t to their actual meaning in glibc)
The standard says nothing of the sort about conforming implementations in general. I'll suppose, however, that you mean to ask specifically about conforming implementations for which wchar_t
is int
and wint_t
is unsigned int
.
On such an implementation, your first program is flawed because it does not account for the possibility that getwchar()
returns WEOF
. Converting WEOF
to type wchar_t
, if doing so does not cause a signal to be raised, is not guaranteed to produce a value that corresponds to any wide character. Passing the result of such a conversion to putwchar()
therefore does not exhibit defined behavior. Moreover, if WEOF
is defined with the same value as UINT_MAX
(which is not representable by int
) then the conversion of that value to int
has implementation-defined behavior independently of the putwchar()
call.
On the other hand, I think the key point you are struggling with is that if the value returned by getwchar()
in the first program is not WEOF
, then it is guaranteed to be one that is unchanged by conversion to wchar_t
. Your first program will perform as appears to be intended in that case, but the cast to int
(or wchar_t
) is unnecessary.
Similarly, the second program is correct provided that the wide-character literal corresponds to a character in the applicable extended character set, but the cast is unnecessary and changes nothing. The wchar_t
value of such a literal is guaranteed to be representable by type wint_t
, so the cast changes the type of its operand, but not the value. (But if the literal does not correspond to a character in the extended character set then the behavior is implementation-defined.)
On the third hand, if your objective is to write strictly-conforming code then the right thing to do, and indeed the intended usage mode of these particular wide-character functions, would be this:
#include <locale.h>
#include <wchar.h>
int main(void)
{
setlocale(LC_CTYPE, "en_US.UTF-8");
wint_t wc = getwchar();
if (wc != WEOF) {
// No cast is necessary or desirable
putwchar(wc);
}
}
and this:
#include <locale.h>
#include <wchar.h>
#include <wctype.h>
int main(void)
{
setlocale(LC_CTYPE, "en_US.UTF-8");
wchar_t wc = L'ÿ';
// No cast is necessary or desirable
if (iswlower(wc)) return 0;
return 1;
}