Why do C++ streams use char instead of unsigned char?
Asked Answered
H

4

57

I've always wondered why the C++ Standard library has instantiated basic_[io]stream and all its variants using the char type instead of the unsigned char type. char means (depending on whether it is signed or not) you can have overflow and underflow for operations like get(), which will lead to implementation-defined value of the variables involved. Another example is when you want to output a byte, unformatted, to an ostream using its put function.

Any ideas?


Note: I'm still not really convinced. So if you know the definitive answer, you can still post it indeed.

Hangout answered 10/11, 2008 at 11:24 Comment(3)
I can't give a why, but I do know that the signedness of characters in GCC depends on the underlying CPU and OS. So the convention changes from one CPU/OS to another. I just can't say why it changes.Robenarobenia
Great question! Hoping somebody give us a good reason. ACE guys use unsigned char as their ACE_Byte type ( aoc.nrao.edu/php/tjuerges/ALMA/ACE-5.5.2/html/ace/… ).Glair
..or why pick char from the 5 different 8-bit types: char, signed char, unsigned char, int8_t and uint8_t. (my vote would be for the last in this list)Relator
R
29

Possibly I've misunderstood the question, but conversion from unsigned char to char isn't unspecified, it's implementation-dependent (4.7-3 in the C++ standard).

The type of a 1-byte character in C++ is "char", not "unsigned char". This gives implementations a bit more freedom to do the best thing on the platform (for example, the standards body may have believed that there exist CPUs where signed byte arithmetic is faster than unsigned byte arithmetic, although that's speculation on my part). Also for compatibility with C. The result of removing this kind of existential uncertainty from C++ is C# ;-)

Given that the "char" type exists, I think it makes sense for the usual streams to use it even though its signedness isn't defined. So maybe your question is answered by the answer to, "why didn't C++ just define char to be unsigned?"

Rousing answered 10/11, 2008 at 11:43 Comment(10)
I thought implementation-dependent is the same as unspecified. I will correct my question and look up on the difference. thanks for telling me :)Hangout
Unspecified means the implementation can put any value it likes in there (including picking one randomly each time it happens), and not document what it does. Implementation-dependent means that the implementation must document what value it puts in there.Rousing
I heard that removing the C heritage from C++ yielded D :)Sidon
ok thanks mate. anyway i meant if you were doing char c = foo.get(); doSomething(c); and don't care about EOF since you know you are not at the end.Hangout
That's then an issue with converting int_type to char. You can probably rely on the implementation to choose int_type such that this conversion is sensible, even if it technically could do something weird.Rousing
Hang on, I missed something out: when converting to a signed type, if the value is representable in the target type, then the resulting value is unchanged (also 4.7.3). The return from get is defined to be either a character value or eof, so if it's not eof then the conversion is defined.Rousing
indeed, the implementation-defined'ness is only triggered for values >CHAR_MAX. that can happen in std::istringstream a("\xa4"); char c = a.get(); , if CHAR_MAX is 127 for example. but indeed the correct way is using int. but most will just use char anyway, since they don't know about this. -.-Hangout
"where signed byte arithmetic is faster than unsigned byte arithmetic" if you do any arithmetic on ((un)?signed)? char it will be promoted to (unsigned)? int. Just doing +c is enough to trigger the promotion (this is the main use of unary +). This is a difference with C: in C any use of the value (not lvalue) of a ((un)?signed)? char will get it promoted.Toiletry
That's true, but if a, b, c are all unsigned char, and you do a = b + c, then because the result of b+c is converted back to unsigned char at the end, if you happen to have incredibly fast unsigned char arithmetic on the CPU then the compiler can use it. Same with signed char, if implementation is willing to have the same overflow behavior as the CPU. Although the values are converted to int as far as the defined C semantics are concerned, there's a significant class of cases where the effect of smaller arithmetic is the same and so the compiler can use it.Rousing
@xtofl: ... no, it yields 'peepee', sheez dude ... :)Relator
W
16

I have always understood it this way: the purpose of the iostream class is to read and/or write a stream of characters, which, if you think about it, are abstract entities that are only represented by the computer using a character encoding. The C++ standard makes great pains to avoid pinning down the character encoding, saying only that "Objects declared as characters (char) shall be large enough to store any member of the implementation's basic character set," because it doesn't need to force the "implementation basic character set" to define the C++ language; the standard can leave the decision of which character encoding is used to the implementation (compiler together with an STL implementation), and just note that char objects represent single characters in some encoding.

An implementation writer could choose a single-octet encoding such as ISO-8859-1 or even a double-octet encoding such as UCS-2. It doesn't matter. As long as a char object is "large enough to store any member of the implementation's basic character set" (note that this explicitly forbids variable-length encodings), then the implementation may even choose an encoding that represents basic Latin in a way that is incompatible with any common encoding!

It is confusing that the char, signed char, and unsigned char types share "char" in their names, but it is important to keep in mind that char does not belong to the same family of fundamental types as signed char and unsigned char. signed char is in the family of signed integer types:

There are four signed integer types: "signed char", "short int", "int", and "long int."

and unsigned char is in the family of unsigned integer types:

For each of the signed integer types, there exists a corresponding (but different) unsigned integer type: "unsigned char", "unsigned short int", "unsigned int", and "unsigned long int," ...

The one similarity between the char, signed char, and unsigned char types is that "[they] occupy the same amount of storage and have the same alignment requirements". Thus, you can reinterpret_cast from char * to unsigned char * in order to determine the numeric value of a character in the execution character set.

To answer your question, the reason why the STL uses char as the default type is because the standard streams are meant for reading and/or writing streams of characters, represented by char objects, not integers (signed char and unsigned char). The use of char versus the numeric value is a way of separating concerns.

Wheal answered 27/5, 2010 at 19:51 Comment(2)
char and signed char not the same? Oh wow, +1! As Scott Meyers would say, Aha! artima.com/cppsource/top_cpp_aha_moments.htmlBoyd
Both istream and fread (from C) read characters from streams, but fread uses unsigned char and istream uses char.Premonish
A
4

char is for characters, unsigned char for raw bytes of data, and signed chars for, well, signed data.

Standard does not specify if signed or unsigned char will be used for the implementation of char - it is compiler-specific. It only specifies that the "char" will be "enough" to hold characters on you system - the way characters were in those days, which is, no UNICODE.

Using "char" for characters is the standard way to go. Using unsigned char is a hack, although it'll match compiler's implementation of char on most platforms.

Alexandros answered 10/11, 2008 at 11:44 Comment(1)
"Using "char" for characters is the standard way to go. Using unsigned char is a hack", how is that? Streams are not only for exchanging basic characters, but also for exchanging binary data (after all, that's what ios_base::binary is for). Would it use unsigned char, we would not have to care about negative char values at all, and always get positive values back. It would seem to be so much nicer.Hangout
A
0

I think this comment explains it well. To quote:

signed char and unsigned char are arithmetic, integral types just like int and unsigned int. On the other hand, char is expressly intended to be the "I/O" type that represents some opaque, system-specific fundamental unit of data on your platform. I would use them in this spirit.

Accomplishment answered 20/5, 2012 at 11:10 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.