Stringize operator failure
Asked Answered
D

1

7

The C and C++ standards all include text to the effect that if a stringize operation fails to produce a valid string literal token, the behavior is undefined. In C++11 this is actually possible, by including a newline character in a raw string literal. But the catch-all has always been there in the standards.

Is there any other way that stringize can produce UB, where UB or an ill-formed program hasn't already happened?

I'd be interested to hear about any dialect of C or C++ whatsoever. I'm writing a preprocessor.

Diploma answered 2/7, 2013 at 1:55 Comment(3)
Most people struggle to get them to work, not to fail.Incongruous
Hah… I'd like to verify that failure works, i.e. get a testcase. The trick with newlines doesn't help because I trap it and add a \n. (Well, it's a "\\\\n" if you're counting backslashes.)Diploma
Ok then I did not get what you are asking, you need preprocessor testcases. mcpp has some validation suite.Rafaelrafaela
B
5

The stringify (#) operator only escapes \ in string constants. Indeed, \ has no particular significance outside of a string constant, except at the end of a line. It is, therefore, a preprocessing token (C section 6.4, C++ section 2.5).

Consequently, if we have

#define Q(X) #X

then

Q(\)

is a legitimate call: the \ is a preprocessing token which is never converted to a token, so it's valid. But you can't stringify \; that would give you "\" which is not a valid string literal. Hence, the behaviour of the above is undefined.

Here's a more amusing test case:

#define Q(A) #A
#define ESCAPE(c) Q(\c)
const char* new_line=ESCAPE(n);
const char* undefined_behaviour=ESCAPE(x);

A less interesting case of an undefined stringify is where the stringified parameter would be too long to be a string literal. (The standards recommend that the maximum size of a string literal be at least 65536 characters, but say nothing about the maximum size of a macro argument, which could presumably be larger.)

Buhl answered 2/7, 2013 at 6:8 Comment(5)
Thanks! Shoulda thought of that. An unterminated string is already something I've tested in raw string catenation, and this gets trapped the exact same way :) . Your more amusing case doesn't appear to be UB in the preprocessor; it's exactly the same as writing "\x" or am I missing something? (Escape sequences are translated later.)Diploma
@Potatoswatter: A string literal contains s-chars, escape-sequences and universal-character-names. \x is none of the above. So "\x" is not a valid string literal, the way I see it, and thus how the preprocessor deals with ESCAPE(x) (or, for that matter, ESCAPE(*)) is undefined. So the preprocessor could, if it chose, replace both of them with a smiley.Buhl
At least in C++, "Escape sequences in which the character following the backslash is not listed in Table 7 are conditionally-supported, with implementation-defined semantics." So for a discrete preprocessor, I think trapping that would be a bit restrictive. But you are right, that is the grammar :) Thanks again!Diploma
@Potatoswatter: I agree, but I don't think that applies to \x. In any event, for UB there are no criteria at all, so whatever you decide is cool. You could even allow ESCAPE(x) "7F", although "\x" "7F" is probably an error. (And even then, I don't think it's an error which needs to be reported.) By the way, how do you use stringify to generate a raw string literal, never mind put a newline into one?Buhl
Catenation can generate an unterminated raw string literal. A raw string literal containing a newline character is a valid input to stringizing. Those are different cases. But the same encapsulated tokenizer is used to implement both operators, so it's a common code path for me. As for that last example, "\x" "7F" is not allowed to catenate to "\x7F" because character translation and literal catenation occur in phases 5 and 6, respectively.Diploma

© 2022 - 2024 — McMap. All rights reserved.