How does the C compiler parse the following C statement?

Asked 30/8, 2014 at 12:47 Answered 30/8, 2014 at 17:41

Solved c compiler-construction printf lexical-analysis

Consider the following lines:

int i;
printf("%d",i);

Will the lexical analyzer go into the string to parse % and d as separate tokens, or will it parse "%d" as one token?

Alienation answered 30/8, 2014 at 12:47 Comment(7)

lexical analyzer will identify "%d" as compete string. It is next phases of compiler "syntax" and "semantic" those identify missing argument and type check. If you compile code (without i in printf) you will get warning "format ‘%d’ expects a matching ‘int’ argument" due to "syntax" and "semantic" checks. – Laudianism 30/8, 2014 at 12:59

how can a syntax analysis find an error if the string is not seperately parsed as token? – Alienation 30/8, 2014 at 12:59

"syntax analysis" is next phase after "lexical". Compiler fist generates stream of tokens (as given in sepp2k's answer) then stream of tokes are further parsed (using grammar) and semantically checked in next phase. – Laudianism 30/8, 2014 at 13:1

Try this code on your PC. You will get warning for second line print but not for str = "%d" that means compiler only parse "%d" when it is in printf - that indicates work is done after lexical phase – Laudianism 30/8, 2014 at 13:6

try this trick – Laudianism 30/8, 2014 at 13:14

Lexical analysers don't parse anything. They scan it. Parsers parse. String literals are scanned without regard to the contents. The compiler doesn't know anything about printf() except for its presence, signature, and linkage. – Reviewer 31/8, 2014 at 1:14

I think you might be confusing terms -- when you say "one token" do you mean like how "\n" is parsed as a single character (even though you wrote two?) -- that is not the case for "%d" -- it gets parsed as two individual characters, since "%" is not an escape character that the compiler knows about / honors. – Superstitious 31/8, 2014 at 9:57

There are two parsers at work here: first, the C compiler, that will parse the C file and basically ignore the content of the string (though modern C compilers will parse the string as well to help catch bad format strings — mismatches between the % conversion specifier and the corresponding argument passed to printf() to be converted).

The next parser is the string format parser built into the C runtime library. This will be called at runtime to parse the format string when you call printf. This parser is of course very simple in comparison.

I have not checked, but I would guess that the C compilers that help checking for bad format strings will implement a printf-like parser as a post-processing step (i.e. using its own lexer).

Bordy answered 30/8, 2014 at 13:7 Comment(0)

A string literal is a single token. The above code will be tokenized like this:

int     keyword "int"
i       identifier
;       semicolon
printf  identifier
(       open paren
"%d"    string literal
,       comma
i       identifier
)       closing paren
;       semicolon

Bratton answered 30/8, 2014 at 12:52 Comment(1)

I think the OP wanted to know more along the lines of will "%d" generate a char array with a length of 1, or 2 (i.e. "\n" would generate a char array with a length of 1) -- I think the answer he's looking for is that it would be a string literal with two distinct characters in it (that is then further parsed at run-time by the *printf method). – Superstitious 31/8, 2014 at 9:54

"%d" is a string literal and it will be seen as one token by both the C preprocessor and also by the compiler, we can see this by going to draft C99 standard section 6.4 Lexical elements which defines the following tokens:

token:
  keyword
  identifier
  constant
  string-literal
  punctuator

and the following proprocessing tokens:

preprocessing-token:
  header-name
  identifier
  pp-number
  character-constant
  string-literal
  punctuator
  each non-white-space character that cannot be one of the above

and says:

A token is the minimal lexical element of the language in translation phases 7 and 8. The categories of tokens are: keywords, identifiers, constants, string literals, and punctuators. A preprocessing token is the minimal lexical element of the language in translation phases 3 through 6. The categories of preprocessing tokens are: header names, identifiers, preprocessing numbers, character constants, string literals, punctuators, and single non-white-space characters that do not lexically match the other preprocessing token categories.58) [...]

The different phases of translation are covered in section 5.1.1.2 Translation phases and I will highlight some of the relevant ones here:

[...]

3 The source file is decomposed into preprocessing tokens 6) and sequences of white-space characters (including comments).

[...]

6 Adjacent string literal tokens are concatenated.

7 White-space characters separating tokens are no longer significant. Each preprocessing token is converted into a token. The resulting tokens are syntactically and semantically analyzed and translated as a translation unit.

[...]

The distinction between pre-processor tokens and tokens may seem irrelevant but we can see that in at least one case such as in adjacent string literals for example "%d" "\n" you would have two pre-processor tokens while after phase 6 there would be only one token.

Stanhope answered 30/8, 2014 at 17:41 Comment(0)

Recommended topics

Hot tags