Why are arguments which do not match the conversion specifier in printf undefined behavior?
Asked Answered
J

5

3

In both C (n1570 7.21.6.1/10) and C++ (by inclusion of the C standard library) it is undefined behavior to provide an argument to printf whose type does not match its conversion specification. A simple example:

printf("%d", 1.9)

The format string specifies an int, while the argument is a floating point type.

This question is inspired by the question of a user who encountered legacy code with an abundance of conversion mismatches which apparently did no harm, cf. undefined behaviour in theory and in practice.

Declaring a mere format mismatch UB seems drastic at first. It is clear that the output can be wrong, depending on things like the exact mismatch, argument types, endianness, possibly stack layout and other issues. This extends, as one commentator there pointed out, also to subsequent (or even previous?) arguments. But that is far from general UB. Personally, I never encountered anything else but the expected wrong output.

To venture a guess, I would exclude alignment issues. What I can imagine is that providing a format string which makes printf expect large data together with small actual arguments possibly lets printf read beyond the stack, but I lack deeper insight in the var args mechanism and specific printf implementation details to verify that.

I had a quick look at the printf sources, but they are pretty opaque to the casual reader.

Therefore my question: What are the specific dangers of mis-matching conversion specifiers and arguments in printf which make it UB?

Jaenicke answered 11/11, 2015 at 10:27 Comment(17)
@pmg also in C++? Anyway, I changed it to be a general floating point type.Jaenicke
Yes, 1.9 is always type double. Use 1.9f if you want a float.Recount
I'm not really getting your question: What are the specific dangers of mis-matching conversion specifiers and arguments in printf which make it UB?. The danger is that you get undefined behaviour. Maybe a certain mismatch always works in environnment A, but yields in UB in environnment B.Northward
@MichaelWalz Joker. The committee hopefully only declares something UB if there is no good way to specify a behavior. Why is there none in this specific case? I.o.w I'm looking for the rationale of that decree.Jaenicke
I'm not sure I understand the question. C says it UB as C refuses to impose a specific behavior for these cases.Sandarac
@Sandarac As far as I understand the standard could have said something like "the resulting output of a call to printf is unspecified if a conversion specification does not match its corresponding argument". That would let the OP in the other post sleep better. Instead the committee made it overall UB, so that he must fear that his machine crashes. Why is there the danger of a crash?Jaenicke
@PeterSchneider what should printf("%d", 1.9); What is the "expected wrong output".Northward
@PeterSchneider Because it is UB, so implementations can do whatever they like, for instance, to optimize the implementation of that function.Imperishable
Not sure I understand the downvotes, now on the other hand I am sure there is a duplicate of this someplace but the answer below is spot on and good.Kedge
@MichaelWalz An unspecified but finite sequence of characters would do. That is much more restrictive than general UB. Is the direction of my question really that hard to understand?Jaenicke
@Imperishable Yes, and in which specific way could that do any harm beyond wrong output?Jaenicke
@PeterSchneider: In what cases would "wrong output" be even remotely acceptable?Shien
@PeterSchneider the answer provided by Jonathan Wakely seems pretty clear.Northward
@MichaelFoukarakis I can make many errors in a program which are errors but not UB. They may be unacceptable in the application domain (because I want an accurate account balance) but still don't crash the program or, worse, corrupt unrelated information (for example, your account balance). That distinction seems essential to me. I was curious what could go wrong with a bad conversion specifier, beyond wrong output.Jaenicke
The list of things that can go wrong if you treat one piece of typed data as something it is not is probably endless.Shien
And if the standard specified that incorrect format specifiers result in unspecified (but not undefined) behaviour, should it also try to make it OK to use more format specifiers than arguments? Or it's OK to pass non-pointers where %s is used? The standard doesn't try to do any of that, it just says it's your job to use it correctly.Recount
Is it better that an incorrect program silently produces incorrect output, or that it crashes? If it crashes, it will get discarded or fixed... I don't see why the standard should attempt to prevent incorrect programs from crashing. The programmer should ensure that their format strings and argument datatypes agree, and a good compiler will help by producing warnings.Minimize
L
3

Some compilers may implement variable-format arguments in a way that allows the types of arguments to be validated; since having a program trap on incorrect usage may be better than possibly having it output seemingly-valid-but-wrong information, some platforms may choose to do that.

Because the behavior of traps is outside the realm of the C Standard, any action which might plausibly trap is classified as invoking Undefined Behavior.

Note that the possibility of implementations trapping based on incorrect formatting means that behavior is considered undefined even in cases where the expected type and the actual passed type have the same representation, except that signed and unsigned numbers of the same rank are interchangeable if the values they hold are within the range which is common to both [i.e. if a "long" holds 23, it may be output with "%lX" but not with "%X" even if "int" and "long" are the same size].

Note also that the C89 committee introduced a rule by fiat, which remains to this day, which states that even if "int" and "long" have the same format, the code:

long foo=23;
int *u = &foo;
(*u)++;

invokes Undefined Behavior since it causes information which was written as type "long" to be read as type "int" (behavior would also be Undefined if it was type "unsigned int"). Since a "%X" format specifier would cause data to be read as type "unsigned int", passing the data as type "long" would almost certainly cause the data to be stored somewhere as "long" but subsequently read as type "unsigned int", such behavior would almost likely violate the aforementioned rule.

Ludwog answered 11/11, 2015 at 20:46 Comment(0)
R
11

printf only works as described by the standard if you use it correctly. If you use it incorrectly, the behaviour is undefined. Why should the standard define what happens when you use it wrong?

Concretely, on some architectures floating point arguments are passed in different registers to integer arguments, so inside printf when it tries to find an int matching the format specifier it will find garbage in the corresponding register. Since those details are outside the scope of the standard there is no way to deal with that kind of misbehaviour except to say it's undefined.

For an example of how badly it could go wrong, using a format specifier of "%p" but passing a floating point type could mean that printf tries to read a pointer from a register or stack location which hasn't been set to a valid value and could contain a trap representation, which would cause the program to abort.

Recount answered 11/11, 2015 at 10:32 Comment(5)
I am curious which architecture would do this, as vararg functions typically pass arguments in stack and not in registers.Giroux
@Giroux My thinking. The general implementation will not be able to do anything else. Now there could conceivably be a 2-arg built-in override printf for specific types but I doubt that an implementation would optimize there; it's I/O, for heavens sake. Still, interesting idea.Jaenicke
The compiler is free to expand printf to an intrinsic, or inline it, because it's a standard function so its behaviour is precisely defined. If that happens there is no guarantee that it uses varargs.Recount
Even if the arguments are written to the stack for use with va_list, reading a value as the wrong type could abort, because the bit pattern of the integer 1 (for example) could be a trap representation for a pointer.Recount
Here is an example that 'may' result in a seg fault event: printf( "%s\n", floatValue); as a simple example, for sure either the 'address' that printf() sees would be 'any place in memory. resulting in accessing memory that is not available to the program or print() follows that 'pointer' and starts printing characters from stack or heap it until is encounters a NUL char. Either case is undefined behaviourHavelock
B
3

Just to take your example: suppose that your architecture's procedure call standard says that floating-point arguments are passed in floating-point registers. But printf thinks you are passing an integer, because of the %d format specifier. So it expects an argument on the call stack, which isn't there. Now anything can happen.

Brno answered 11/11, 2015 at 11:3 Comment(1)
Well, as discussed above it's fairly far-fetched for varargs; but of course Jonathan's intrinsic/inlining concept probably is reason enough to principally consider it.Jaenicke
Q
3

Any printf format/argument mismatch will cause erroneous output, so you cannot rely on anything once you do that. It is hard to tell which will have dire consequences beyond garbage output because it depends completely no the specifics of the platform you are compiling for and the actual details of the printf implementation.

Passing invalid arguments to a printf instance that has a %s format can cause invalid pointers to be dereferenced. But invalid arguments for simpler types such as int or double can cause alignment errors with similar consequences.

Quirita answered 11/11, 2015 at 11:19 Comment(0)
L
3

Some compilers may implement variable-format arguments in a way that allows the types of arguments to be validated; since having a program trap on incorrect usage may be better than possibly having it output seemingly-valid-but-wrong information, some platforms may choose to do that.

Because the behavior of traps is outside the realm of the C Standard, any action which might plausibly trap is classified as invoking Undefined Behavior.

Note that the possibility of implementations trapping based on incorrect formatting means that behavior is considered undefined even in cases where the expected type and the actual passed type have the same representation, except that signed and unsigned numbers of the same rank are interchangeable if the values they hold are within the range which is common to both [i.e. if a "long" holds 23, it may be output with "%lX" but not with "%X" even if "int" and "long" are the same size].

Note also that the C89 committee introduced a rule by fiat, which remains to this day, which states that even if "int" and "long" have the same format, the code:

long foo=23;
int *u = &foo;
(*u)++;

invokes Undefined Behavior since it causes information which was written as type "long" to be read as type "int" (behavior would also be Undefined if it was type "unsigned int"). Since a "%X" format specifier would cause data to be read as type "unsigned int", passing the data as type "long" would almost certainly cause the data to be stored somewhere as "long" but subsequently read as type "unsigned int", such behavior would almost likely violate the aforementioned rule.

Ludwog answered 11/11, 2015 at 20:46 Comment(0)
P
2

I'll start by asking you to be aware of the fact that long is 64-bit for 64-bit versions of OS X, Linux, the BSD clones, and various Unix flavors if you aren't already aware. 64-bit Windows, however, kept long as 32-bit.

What does this have to do with printf() and UB with respect to its conversion specifications?

Internally, printf() will use the va_arg() macro. If you use %ld on 64-bit Linux and only pass an int, the other 32 bits will be retrieved from adjacent memory. If you use %d and pass a long on 64-bit Linux, the other 32 bits will still be on the argument stack. In other words, the conversion specification indicates the type (int, long, whatever) to va_arg(), and the size of the corresponding type determines the number of bytes by which va_arg() adjusts its argument pointer. Whereas it will just work on Windows since sizeof(int)==sizeof(long), porting it to another 64-bit platform can cause trouble, especially when you have a int *nptr; and try to use %ld with *nptr. If you don't have access to the adjacent memory, you'll likely get a segfault. So the possible concrete cases are:

  • adjacent memory is read, and output is messed up from that point on
  • adjacent memory is attempted to be read, and there's a segfault due to a protection mechanism
  • the size of long and int are the same, so it just works
  • the value fetched is truncated, and output is messed up from that point on

I'm not sure if alignment is an issue on some platforms, but if it is, it would depend upon the implementation of passing function parameters. Some "intelligent" compiler-specific printf() with a short argument list might bypass va_arg() altogether and represent the passed data as a string of bytes rather than working with a stack. If that happened, printf("%x %lx\n", LONG_MAX, INT_MIN); has three possibilities:

  • the size of long and int are the same, so it just works
  • ffffffff ffffffff80000000 is printed
  • the program crashes due to an alignment fault

As for why the C standard says that it causes undefined behavior, it doesn't specify exactly how va_arg() works, how function parameters are passed and represented in memory, or the explicit sizes of int, long, or other primitive data types because it doesn't unnecessarily constrain implementations. As a result, whatever happens is something the C standard cannot predict. Just looking at the examples above should be an indication of that fact, and I can't imagine what else other implementations exist that might behave differently altogether.

Peking answered 11/11, 2015 at 14:2 Comment(1)
There is another possibility: the compiler may try to outsmart you and just fix the specifier (esp. if inlining). UB allows that too.Christianachristiane

© 2022 - 2024 — McMap. All rights reserved.