Practical difference between parser rules and lexer rules in ANTLR?
Asked Answered
C

2

32

I understand the theory behind separating parser rules and lexer rules in theory, but what are the practical differences between these two statements in ANTLR:

my_rule: ... ;

MY_RULE: ... ;

Do they result in different AST trees? Different performance? Potential ambiguities?

Client answered 28/11, 2010 at 16:30 Comment(0)
M
31

... what are the practical differences between these two statements in ANTLR ...

MY_RULE will be used to tokenize your input source. It represents a fundamental building block of your language.

my_rule is called from the parser, it consists of zero or more other parser rules or tokens produced by the lexer.

That's the difference.

Do they result in different AST trees? Different performance? ...

The parser builds the AST using tokens produced by the lexer, so the questions make no sense (to me). A lexer merely "feeds" the parser a 1 dimensional stream of tokens.

Matejka answered 28/11, 2010 at 17:53 Comment(1)
Performance-wise I understood the questions in a sense if it is better to write: my_rule : 'a'?'b'; or MY_TOKEN : 'ab' | 'b';. Both match the same but how fast? Intuitively I would expect the latter to be a bit faster but I might be wrong.Muscadine
F
13

This post may be helpful:

The lexer is responsible for the first step, and it's only job is to create a "token stream" from text. It is not responsible for understanding the semantics of your language, it is only interested in understanding the syntax of your language.

For example, syntax is the rule that an identifier must only use characters, numbers and underscores - as long as it doesn't start with a number. The responsibility of the lexer is to understand this rule. In this case, the lexer would accept the sequence of characters "asd_123" but reject the characters "12dsadsa" (assuming that there isn't another rule in which this text is valid). When seeing the valid text example, it may emit a token into the token stream such as IDENTIFIER(asd_123).

Note that I said "identifier" which is the general term for things like variable names, function names, namespace names, etc. The parser would be the thing that would understand the context in which that identifier appears, so that it would then further specify that token as being a certain thing's name.

(sidenote: the token is just a unique name given to an element of the token stream. The lexeme is the text that the token was matched from. I write the lexeme in parentheses next to the token. For example, NUMBER(123). In this case, this is a NUMBER token with a lexeme of '123'. However, with some tokens, such as operators, I omit the lexeme since it's redundant. For example, I would write SEMICOLON for the semicolon token, not SEMICOLON( ; )).

From ANTLR - When to use Parser Rules vs Lexer Rules?

Fog answered 3/11, 2018 at 15:3 Comment(1)
Thank you for posting Dave. This link was incredibly helpful!Bittern

© 2022 - 2024 — McMap. All rights reserved.