Lexer to handle lines with line number prefix
Asked Answered
B

1

0

I'm writing a parser for a language that looks like the following:

L00<<identifier>>
L10<<keyword>>
L250<<identifier>>
<<identifier>>

That is, each line may or may not start with a line number of the form Lxxx.. ('L' followed by one or more digits) followed by an identifer or a keyword. Identifiers are standard [a-zA-Z_][a-zA-Z0-9_]* and the number of digits following the L is not fixed. Spaces between the line number and following identifer/keyword are optional (and not present in most cases).

My current lexer looks like:

// Parser rules
commands      : command*;
command       : LINE_NUM? keyword NEWLINE
              | LINE_NUM? IDENTIFIER NEWLINE;
keyword       : KEYWORD_A | KEYWORD_B | ... ;

// Lexer rules
fragment INT  : [0-9]+;
LINE_NUM      : 'L' INT;
KEYWORD_A     : 'someKeyword';
KEYWORD_B     : 'reservedWord';
...
IDENTIFIER    : [a-zA-Z_][a-zA-Z0-9_]*

However this results in all lines beginning with a LINE_NUM token to be tokenized as IDENTIFIERs.

Is there a way to properly tokenize this input using an ANTLR grammar?

Brachial answered 1/4, 2014 at 14:35 Comment(4)
Are there spaces (or a space) between LINE_NUM and IDENTIFIER?Greyso
@Bart Whitespace between LINE_NUM and IDENTIFIER is optional. I edited the question to clarify.Brachial
Your sample implies (if it's valid) that an identifier may be optionally preceded by a LINE_NUM. The grammar says it's mandatory. Is that right?Shanell
The implied grammar of the sample is the correct behavior. LINE_NUM is optional in both cases. I corrected the grammar.Brachial
U
1

You need to add a semantic predicate to IDENTIFIER:

IDENTIFIER
  : {_input.getCharPositionInLine() != 0
      || _input.LA(1) != 'L'
      || !Character.isDigit(_input.LA(2))}?
    [a-zA-Z_] [a-zA-Z0-9_]*
  ;

You could also avoid semantic predicates by using lexer modes.

//
// Default mode is active at the beginning of a line
//

LINE_NUM
  : 'L' [0-9]+ -> pushMode(NotBeginningOfLine)
  ;

KEYWORD_A : 'someKeyword' -> pushMode(NotBeginningOfLine);
KEYWORD_B : 'reservedWord' -> pushMode(NotBeginningOfLine);
IDENTIFIER
  : ( 'L'
    | 'L' [a-zA-Z_] [a-zA-Z0-9_]*
    | [a-zA-KM-Z_] [a-zA-Z0-9_]*
    )
    -> pushMode(NotBeginningOfLine)
  ;
NL : ('\r' '\n'? | '\n');

mode NotBeginningOfLine;

  NotBeginningOfLine_NL : ('\r' '\n'? | '\n') -> type(NL), popMode;
  NotBeginningOfLine_KEYWORD_A : KEYWORD_A -> type(KEYWORD_A);
  NotBeginningOfLine_KEYWORD_B : KEYWORD_B -> type(KEYWORD_B);
  NotBeginningOfLine_IDENTIFIER
    : [a-zA-Z_] [a-zA-Z0-9_]* -> type(IDENTIFIER)
    ;
Ulick answered 1/4, 2014 at 18:8 Comment(6)
Both methods look good, thanks! Any considerations I should take into account that would push me towards one or the other?Brachial
@HarrisonPaine The lexer interpreter isn't able to evaluate semantic predicates, but lexers in combined grammars can't have multiple modes. If it were me, I would use multiple modes and since I always separate my lexers and parsers anyway.Ulick
I implemented the second approach and realized that I was actually facing a different problem, that wasn't evident in the simplified problem. The actual language is nested in another format, which provides a listing of all defined identifiers. Using the method here: https://mcmap.net/q/918826/-can-i-add-antlr-tokens-at-runtime I can just tokenize each identifier exactly, saving a lot of headaches. Still, thanks for answering the question as I posted it; you definitely helped get me out of a stall.Brachial
@280Z28: Is it planned to remove the restraint to disallow multiple lexer modes in combined lexer/parser grammars? Is there any advantage (besides simplicity) from using a combined grammar?Shanell
@Shanell I never use combined grammars, because they allow you to accidentally define new literal tokens in parser rules, quickly leading hard-to-find bugs.Ulick
@280Z28: Ok, but besides being able to create hard-to-find bugs, is there any advantage in using combined grammars?Shanell

© 2022 - 2024 — McMap. All rights reserved.