Best way to portably assign the result of fgetc() to a char in C
Asked Answered
P

4

9

Perhaps I'm overthinking this, as it seems like it should be a lot easier. I want to take a value of type int, such as is returned by fgetc(), and record it in a char buffer if it is not an end-of-file code. E.g.:

char buf;
int c = fgetc(stdin);

if (c < 0) {
    /* handle end-of-file */
} else {
    buf = (char) c;  /* not quite right */
}

However, if the platform has signed default chars then the value returned by fgetc() may be outside the range of char, in which case casting or assigning it to (signed) char produces implementation-defined behavior (right?). Surely, though, there is tons of code out there that does exactly the equivalent of the example. Is it all relying on implementation-defined behavior and/or assuming 7-bit data?

It looks to me like if I want to be certain that the behavior of my code is defined by C to be what I want, then I need to do something like this:

buf = (char) ((c > CHAR_MAX) ? (c - (UCHAR_MAX + 1)) : c);

I think that produces defined, correct behavior whether default chars are signed or unsigned, and regardless even of the size of char. Is that right? And is it really needful to do that to ensure portability?

Polygraph answered 8/10, 2013 at 14:29 Comment(0)
S
4

fgetc() returns unsigned char and EOF. EOF is always < 0. If the system's char is signed or unsigned, it makes no difference.

C11dr 7.21.7.1 2

If the end-of-file indicator for the input stream pointed to by stream is not set and a next character is present, the fgetc function obtains that character as an unsigned char converted to an int and advances the associated file position indicator for the stream (if defined).

The concern I have about is that is looks to be 2's compliment dependent and implying the range of unsigned char and char are both just as wide. Both of these assumptions are certainly nearly always true today.

buf = (char) ((c > CHAR_MAX) ? (c - (UCHAR_MAX + 1)) : c);

[Edit per OP comment]
Let's assume fgetc() returns no more different characters than stuff-able in the range CHAR_MIN to CHAR_MAX, then (c - (UCHAR_MAX + 1)) would be more portable is replaced with (c - CHAR_MAX + CHAR_MIN). We do not know (c - (UCHAR_MAX + 1)) is in range when c is CHAR_MAX + 1.

A system could exist that has a signed char range of -127 to +127 and an unsigned char range 0 to 255. (5.2.4.2.1), but as fgetc() gets a character, it seems to have all be unsigned char or all ready limited itself to the smaller signed char range, before converting to unsigned char and return that value to the user. OTOH, if fgetc() returned 256 different characters, conversion to a narrow ranged signed char would not be portable regardless of formula.

Special answered 8/10, 2013 at 14:38 Comment(9)
You are absolutely correct about the nature of fgetc()'s return value, but that wasn't my question. The crux is that converting an integer value to a signed integer type produces implementation-defined behavior if the value is outside the range of the target type (or please correct me if I'm wrong). If my platform uses 8-bit signed chars, and the character I read is a British pound sign encoded in ISO-8859-1, then fgetc() will return an int value (positive) 0xA3, which is outside the range of char. Isn't that a problem?Polygraph
In practice, it is not a problem. Sorting out where the standard says it isn't may take a bit of effort. Every implementation defines the conversion from 0xA3 to signed char as a bitwise copy of the lowest 8 bits of the int value (and an implementation which did not would probably not be used).Oberg
@chux: upon reflection I think the discussion of character codes vs. signed default chars is a false trail. As in many places they do, and as C itself does, the docs you quoted conflate "character" and "byte". The question might better be couched in terms of how multiple fgetc() calls can portably be substituted for one fread() call when the target buffer is of type default char (as may be needed as input to some other function that will treat it as binary data).Polygraph
@Jonathan Leffler: thanks, I do realize that in practice most, perhaps all platforms of interest will perform the conversion in the way you describe. I am being anal about portability. Inasmuch as you and chux seem to agree with my assessment that the behavior of the first example code is implementation-defined, do you agree that my alternative solves that problem, even if perhaps it's overkill?Polygraph
For characters returned from fgets() that are > CHAR_MAX, mapping them to the negative range of CHAR_MIN to -1 depends on the integer notation used. Your solution implies 2's, but would have trouble with 1's compliment, signed magnitude, etc. Of course these are rarely used today. If using a 2's compliment machine, the simple buf = (char) c would be a portable as your proposed conversion.Special
@John Bollinger To be truly portable, stick with unsigned char, but I suspect that leads to other issues down the road. I do not see using buf = (char) c today would introduce some UB with modern machines. Your post, is not anal, but gets at the core of the foundation of C.Special
@chux If you will indulge me just a bit more, can you comment on what the fgetc() specs actually mean? For example, if I make buf a default char array and do fgets(buf, 2, stdin), then I would like to compare the resulting buf[0] to what I would have gotten from fgetc(stdin). Is it correct regardless of platform to do c = (unsigned char) buf[0] to obtain the latter value from the former?Polygraph
@John Bollinger (let's set aside I/O errors and assume EOF was not encountered.) The comparison of c = fgetc(stdin) and char buf[2]; fgets(buf, 2, stdin); c = (unsigned char) buf[0] depends on signed char to unsigned char conversion which is "... the value is converted by repeatedly adding or subtracting one more than the maximum value that can be represented in the new type until the value is in the range of the new type." C11dr 6.3.1.3. If char holds as many different values as unsigned char, I see no difference.Special
@John Bollinger If such int to char conversions are potentially problematic, recommend using a global macro (or function). But all-in-all, I see other portability issues much more a concern that does buf = (char) c always work.Special
T
4

Practically, it's simple - the obvious cast to char always works.
But you're asking about portability...

I can't see how a real portable solution could work.
This is because the guaranteed range of char is -127 to 127, which is only 255 different values. So how could you translate the 256 possible return values of fgetc (excluding EOF), to a char, without losing information?

The best I can think of is to use unsigned char and avoid char.

Trinatte answered 8/10, 2013 at 15:17 Comment(0)
P
2

With thanks to those who responded, and having now read relevant portions of the C99 standard, I have come to agree with the somewhat surprising conclusion that storing an arbitrary non-EOF value returned by fgetc() as type char without loss of fidelity is not guaranteed to be possible. In large part, that arises from the possibility that char cannot represent as many distinct values as unsigned char.

For their part, the stdio functions guarantee that if data are written to a (binary) stream and subsequently read back, then the read back data will compare equal to the original data. That turns out to have much narrower implications than I at first thought, but it does mean that fputs() must output a distinct value for each distinct char it successfully outputs, and that whatever conversion fgets() applies to store input bytes as type char must accurately reverse the conversion, if any, by which fputs() would produce the input byte as its output. As far as I can tell, however, fputs() and fgets() are permitted to fail on any input they don't like, so it is not certain that fputs() maps every possible char value to an unsigned char.

Moreover, although fputs() and fgets() operate as if by performing sequences of fputc() and fgetc() calls, respectively, it is not specified what conversions they might perform between char values in memory and the underlying unsigned char values on the stream. If a platform's fputs() uses standard integer conversion for that purpose, however, then the correct back-conversion is as I proposed:

int c = fgetc(stream);
char buf;

if (c >= 0) buf = (char) ((c > CHAR_MAX) ? (c - (UCHAR_MAX + 1)) : c);

That arises directly from the integer conversion rules, which specify that integer values are converted to unsigned types by adding or subtracting the integer multiple of <target type>_MAX + 1 needed to bring the result into the range of the target type, supported by the constraints on representation of integer types. Its correctness for that purpose does not depend on the specific representation of char values or on whether char is treated as signed or unsigned.

However, if char cannot represent as many distinct values as unsigned char, or if there are char values that fgets() refuses to output (e.g. negative ones), then there are possible values of c that could not have resulted from a char conversion in the first place. No back-conversion argument is applicable to such bytes, and there may not even be a meaningful sense of char values corresponding to them. In any case, whether the given conversion is the correct reverse-conversion for data written by fputs() seems to be implementation defined. It is certainly implementation-defined whether buf = (char) c will have the same effect, though it does have on very many systems.

Overall, I am struck by just how many details of C I/O behavior are implementation defined. That was an eye-opener for me.

Polygraph answered 9/10, 2013 at 17:15 Comment(5)
Is there anything in the C standard that would force a one's complement or sign/magnitude C implementation to choose unsigned for their default char? e.g. some requirement that a char be able to represent every possible byte (assuming 8-bit char)? If so, that would close this apparent loophole. If not, then such an actively-hostile C implementation would perhaps be legal, but not good. i.e. the implementation-defined behaviour should be chosen so that this works right for a C implementation to be useful! (I wish C would standardize more stuff, like arithmetic shifts...)Catholicism
@PeterCordes, no, the standard places no such limitation on implementations' choices. With respect to this matter, it says only "The implementation shall define char to have the same range, representation, and behavior as either signed char or unsigned char." In practice, however, although the standard does not require it, I'd consider it a quality-of-implementation issue that implementations should comply with the rule you suggest.Polygraph
I meant it might indirectly end up placing that requirement. i.e. maybe there's a rule somewhere else that can't be satisfied if char is signed one's complement.Catholicism
@PeterCordes, your question is by nature difficult to answer with certainty. I'm now a lot more familiar with the standard than I was when I originally posed the question and posted this answer, however, and I am fairly confident that there is no combination of provisions that have the effect you describe.Polygraph
That's too bad, and thanks for the reply. It would be nice if such diabolical C implementations weren't allowed, so we could stop agonizing over being truly portable when we (think we) know that no reasonable modern C implementation would be like this, for modern or potential future hardware. Learning Rust is on my to-do list; I like their idea of making 2's complement overflow allowed only if specifically requested (foo.wrapping_add(0x1234)), and overflow-detection, and built-in integer functions including popcnt.Catholicism
S
2

Best way to portably assign the result of fgetc() to a char in C

C2X is on the way

A sub-problem is saving an unsigned char value into a char, which may be signed. With 2's complement, that is not a problem.*1

On non-2's complement machines with signed char that do not support -0 *2, that is a problem. (I know of no such machines.)

In any case, with C2X, support for non-2's complement encoding is planned to be dropped, so as time goes on, we can eventually ignore non-2's complement issues and confidently use

int c = fgetc(stdin);
... 
char buf = (c > CHAR_MAX) ? (char)(c - (UCHAR_MAX + 1)) : (char)c;

UCHAR_MAX > INT_MAX??

A 2nd portability issue not discussed is when UCHAR_MAX > INT_MAX. e.g. All integer types are 64-bit. Some graphics processor have used a common size for all integer types.

On such unicorn machines, if (c < 0) is insufficient. Could use:

int c = fgetc(stdin);

#if UCHAR_MAX <= INT_MAX
  if (c < 0) {
#else 
  if (c == EOF && (feof(stdin) || ferror(stdin))) {
#endif
...

Pedantically, ferror(stdin) could be true due to a prior input function and not this one which returned UCHAR_MAX, but let us not go into that rabbit-hole.


*1 In the case of int to signed char with c > CHAR_MAX, "Otherwise, the new type is signed and the value cannot be represented in it; either the result is implementation-defined or an implementation-defined signal is raised." applies. With 2's complement, this is overwhelmingly maps [128 255] to [-128 -1].

*2 With non-2's compliment and -0 support, the common mapping is least 8 bits remain the same. This does make for 2 zeros, yet properly handling of strings in <string.h> uses "For all functions in this subclause, each character shall be interpreted as if it had the type unsigned char (and therefore every possible object representation is valid and has a different value)." So -0 is not a null character as that char is accessed as a non-zero unsigned char.

Special answered 23/8, 2022 at 13:0 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.