Responsibilities of the Lexer and the Parser
Asked Answered
B

4

5

I'm currently implementing a lexer for a simple programming language. So far, I can tokenize identifiers, assignment symbols, and integer literals correctly; in general, whitespace is insignificant.

For the input foo = 42, three tokens are recognized:

  1. foo (identifier)
  2. = (symbol)
  3. 42 (integer literal)

So far, so good. However, consider the input foo = 42bar, which is invalid due to the (significant) missing space between 42 and bar. My lexer incorrectly recognizes the following tokens:

  1. foo (identifier)
  2. = (symbol)
  3. 42 (integer literal)
  4. bar (identifier)

Once the lexer sees the digit 4, it keeps reading until it encounters a non-digit. It therefore consumes the 2 and stores 42 as an integer literal token. Because whitespace is insignificant, the lexer discards any whitespace (if there is any) and starts reading the next token: It finds the identifier bar.

Now, here's my question: Is it still the lexer's responsibility to recognize that an identifier is not allowed at that position? Or does that check belong to the responsibilities of the parser?

Benedikta answered 13/5, 2014 at 0:47 Comment(0)
S
5

I don't think there is any consensus to the question of whether 42foo should be recognised as an invalid number or as two tokens. It's a question of style and both usages are common in well-known languages.

For example:

$ python -c 'print 42and False'
False

$ lua -e 'print(42and false)'
lua: (command line):1: malformed number near '42a'

$ perl -le 'print 42and 0'
42

# Not an idiosyncracy of tcc; it's defined by the standard
$ tcc -D"and=&&" -run - <<<"main(){return 42and 0;}"
stdin:1: error: invalid number

# gcc has better error messages
$ gcc -D"and=&&" -x c - <<<"main(){return 42and 0;}" && ./a.out
<stdin>: In function ‘main’:
<stdin>:1:15: error: invalid suffix "and" on integer constant
<stdin>:1:21: error: expected ‘;’ before numeric constant

$ ruby -le 'print 42and 1'
42

# And now for something completely different (explained below)
$ awk 'BEGIN{print 42foo + 3}'
423

So, both possibilities are in common use.

If you're going to reject it because you think a number and a word should be separated by whitespace, you should reject it in the lexer. The parser cannot (or should not) know whether whitespace separates two tokens. Independent of the validity of 42and, the fragments 42 + 1, 42+1, and 42+ 1) should all be parsed identically. (Except, perhaps, in Fortress. But that was an anomaly.) If you don't mind shoving numbers and words together, then let the parser reject it if (and only if) it is a syntax error.

As a side note, in C and C++, 42and is initially lexed as a "preprocessor number". After preprocessing, it needs to be relexed and it is at that point that the error message is produced. The reason for this odd behaviour is that it is completely legitimate to paste together two fragments to produce a valid number:

$ gcc -D"c_(x,y)=x##y" -D"c(x,y)=c_(x,y)"  -x c - <<<"int main(){return c(12E,1F);}"
$ ./a.out; echo $?
120

Both 12E and 1F would be invalid integers, but pasted together with the ## operator, they form a perfectly legitimate float. The ## operator only works on single tokens, so 12E and 1F both need to lexed as single tokens. c(12E+,1F) wouldn't work, but c(12E0,1F) is also fine.

This is also why you should always put spaces around the + operator in C: classic trick C question: "What is the value of 0x1E+2?"

Finally, the explanation for the awk line:

$ awk 'BEGIN{print 42foo + 3}'
423

That's lexed by awk as BEGIN{print 42 foo + 3} which is then parsed as though it had been written BEGIN{print (42)(foo + 3);}. In awk, string concatenation is written without an operator, but it binds less tightly than any arithmetic operator. Consequently, the usual advice is to use explicit parentheses in expressions which involve concatenation, unless they are really simple. (Also, undefined variables are assumed to have the value 0 if used arithmetically and "" if used as strings.)

Schweinfurt answered 13/5, 2014 at 5:14 Comment(1)
+1, very detailed answer. I also like the real-world-examples.Benedikta
B
5

I disagree with other answers here. It should be done by the lexer. If the character following the digits isn't whitespace or a special character, you're in the middle of an illegal token, specifically an identifier that doesn't start with a letter.

Or else just return the 45 and the 'bar' separately and let the parser handle it as a syntax error.

Bathesda answered 13/5, 2014 at 1:24 Comment(2)
This is the concern I was having, too. I'm by no means an expert in lexer/parser generation, but it didn't really feel right for the lexer to accept this.Benedikta
I actually misread the description, despite the bold text on "invalid." You're right, it would fail before it got to the parser.Repetition
S
5

I don't think there is any consensus to the question of whether 42foo should be recognised as an invalid number or as two tokens. It's a question of style and both usages are common in well-known languages.

For example:

$ python -c 'print 42and False'
False

$ lua -e 'print(42and false)'
lua: (command line):1: malformed number near '42a'

$ perl -le 'print 42and 0'
42

# Not an idiosyncracy of tcc; it's defined by the standard
$ tcc -D"and=&&" -run - <<<"main(){return 42and 0;}"
stdin:1: error: invalid number

# gcc has better error messages
$ gcc -D"and=&&" -x c - <<<"main(){return 42and 0;}" && ./a.out
<stdin>: In function ‘main’:
<stdin>:1:15: error: invalid suffix "and" on integer constant
<stdin>:1:21: error: expected ‘;’ before numeric constant

$ ruby -le 'print 42and 1'
42

# And now for something completely different (explained below)
$ awk 'BEGIN{print 42foo + 3}'
423

So, both possibilities are in common use.

If you're going to reject it because you think a number and a word should be separated by whitespace, you should reject it in the lexer. The parser cannot (or should not) know whether whitespace separates two tokens. Independent of the validity of 42and, the fragments 42 + 1, 42+1, and 42+ 1) should all be parsed identically. (Except, perhaps, in Fortress. But that was an anomaly.) If you don't mind shoving numbers and words together, then let the parser reject it if (and only if) it is a syntax error.

As a side note, in C and C++, 42and is initially lexed as a "preprocessor number". After preprocessing, it needs to be relexed and it is at that point that the error message is produced. The reason for this odd behaviour is that it is completely legitimate to paste together two fragments to produce a valid number:

$ gcc -D"c_(x,y)=x##y" -D"c(x,y)=c_(x,y)"  -x c - <<<"int main(){return c(12E,1F);}"
$ ./a.out; echo $?
120

Both 12E and 1F would be invalid integers, but pasted together with the ## operator, they form a perfectly legitimate float. The ## operator only works on single tokens, so 12E and 1F both need to lexed as single tokens. c(12E+,1F) wouldn't work, but c(12E0,1F) is also fine.

This is also why you should always put spaces around the + operator in C: classic trick C question: "What is the value of 0x1E+2?"

Finally, the explanation for the awk line:

$ awk 'BEGIN{print 42foo + 3}'
423

That's lexed by awk as BEGIN{print 42 foo + 3} which is then parsed as though it had been written BEGIN{print (42)(foo + 3);}. In awk, string concatenation is written without an operator, but it binds less tightly than any arithmetic operator. Consequently, the usual advice is to use explicit parentheses in expressions which involve concatenation, unless they are really simple. (Also, undefined variables are assumed to have the value 0 if used arithmetically and "" if used as strings.)

Schweinfurt answered 13/5, 2014 at 5:14 Comment(1)
+1, very detailed answer. I also like the real-world-examples.Benedikta
P
1

Yes, contextual checks like this belong in the parser.

Also, you say that foo = 42bar is invalid. From the lexer's perspective, it is not, though. The 4 tokens recognized by your lexer are (probably) correct (you don't post your token definitions).

foo = 42bar may or may not be a valid statement in your language.

Pesthouse answered 13/5, 2014 at 1:10 Comment(3)
Alright, that sounds reasonable. My description might have been misleading: The statement is invalid in the language, but my lexer happily recognized the tokens anyway. I just wanted to make sure that it's okay for the lexer to not flag invalid tokens at that point.Benedikta
It is. They are not invalid as tokens - it is the syntax that makes this particular combination of tokens invalid - hence the parser flagging (when you have a traditional lexer/parser architecture).Pesthouse
Unless 42bar is a legal token in the language, e.g. an identifier, it should fail at the lexer stage.Bathesda
R
0

Edit: I just realized that that's actually an invalid token for your language. So yes, it would fail the lexer at that point, because you have no rule matching it. Otherwise, what would it be, InvalidTokenToken?

But let's say that it was a valid token. Say you write a lexer rule saying that id = <number> is ok... what do you do about id = <number> + <number> - <number>, and all of the various combinations that that leads to? How is the lexer going to give you an AST for any of those? This is where the parser comes in.

Are you using a parser combinator framework? I ask because sometimes with those the distinction between parser and lexer rules starts to seem arbitrary, especially since you may not have an explicit grammar in front of you. But the language you're parsing still has a grammar, and what counts as a parser rule is each production of the grammar. At the very "bottom" if you have rules which describe a single terminal, like "a number is one or more digits," and this, and this alone is what the lexer gets used for -- the reason being that it can speed up the parser and simplify its implementation.

Repetition answered 13/5, 2014 at 1:15 Comment(3)
I'm building a handwritten lexer for educational purposes, but when I implement the parser I might choose to go with a framework; I'm not sure yet, it's mostly about learning. Your answer clarified a few things for me, +1!Benedikta
I just realized I made a mistake, and rewrote some of this. Essentially what your lexer will look like is a list of regular expressions, with an associated number for the token type. For each group of characters between whitespace, you try each regex in turn and when one matches, you return the character range tagged with the corresponding token type. If none match, it fails.Repetition
That’s not how hand lexers are written. They switch on the next character and have little loops in each case.Bathesda

© 2022 - 2024 — McMap. All rights reserved.