Consider the following lines:
int i;
printf("%d",i);
Will the lexical analyzer go into the string to parse %
and d
as separate tokens, or will it parse "%d" as one token?
Consider the following lines:
int i;
printf("%d",i);
Will the lexical analyzer go into the string to parse %
and d
as separate tokens, or will it parse "%d" as one token?
There are two parsers at work here: first, the C compiler, that will parse the C file and basically ignore the content of the string (though modern C compilers will parse the string as well to help catch bad format strings — mismatches between the %
conversion specifier and the corresponding argument passed to printf()
to be converted).
The next parser is the string format parser built into the C runtime library. This will be called at runtime to parse the format string when you call printf
. This parser is of course very simple in comparison.
I have not checked, but I would guess that the C compilers that help checking for bad format strings will implement a printf
-like parser as a post-processing step (i.e. using its own lexer).
A string literal is a single token. The above code will be tokenized like this:
int keyword "int"
i identifier
; semicolon
printf identifier
( open paren
"%d" string literal
, comma
i identifier
) closing paren
; semicolon
"%d"
is a string literal and it will be seen as one token by both the C preprocessor and also by the compiler, we can see this by going to draft C99 standard section 6.4
Lexical elements which defines the following tokens:
token:
keyword
identifier
constant
string-literal
punctuator
and the following proprocessing tokens:
preprocessing-token:
header-name
identifier
pp-number
character-constant
string-literal
punctuator
each non-white-space character that cannot be one of the above
and says:
A token is the minimal lexical element of the language in translation phases 7 and 8. The categories of tokens are: keywords, identifiers, constants, string literals, and punctuators. A preprocessing token is the minimal lexical element of the language in translation phases 3 through 6. The categories of preprocessing tokens are: header names, identifiers, preprocessing numbers, character constants, string literals, punctuators, and single non-white-space characters that do not lexically match the other preprocessing token categories.58) [...]
The different phases of translation are covered in section 5.1.1.2
Translation phases and I will highlight some of the relevant ones here:
[...]
3 The source file is decomposed into preprocessing tokens 6) and sequences of white-space characters (including comments).
[...]
6 Adjacent string literal tokens are concatenated.
7 White-space characters separating tokens are no longer significant. Each preprocessing token is converted into a token. The resulting tokens are syntactically and semantically analyzed and translated as a translation unit.
[...]
The distinction between pre-processor tokens and tokens may seem irrelevant but we can see that in at least one case such as in adjacent string literals for example "%d" "\n"
you would have two pre-processor tokens while after phase 6
there would be only one token.
© 2022 - 2024 — McMap. All rights reserved.
"%d"
as compete string. It is next phases of compiler "syntax" and "semantic" those identify missing argument and type check. If you compile code (withouti
in printf) you will get warning"format ‘%d’ expects a matching ‘int’ argument"
due to "syntax" and "semantic" checks. – Laudianismstr = "%d"
that means compiler only parse"%d"
when it is in printf - that indicates work is done after lexical phase – Laudianismprintf()
except for its presence, signature, and linkage. – Reviewer