Lexer to handle lines with line number prefix - McMap

About

Lexer to handle lines with line number prefix

Asked 1/4, 2014 at 14:35 Answered 1/4, 2014 at 18:8

Solved antlr antlr4

B

1

0

I'm writing a parser for a language that looks like the following:

L00<<identifier>>
L10<<keyword>>
L250<<identifier>>
<<identifier>>

That is, each line may or may not start with a line number of the form Lxxx.. ('L' followed by one or more digits) followed by an identifer or a keyword. Identifiers are standard [a-zA-Z_][a-zA-Z0-9_]* and the number of digits following the L is not fixed. Spaces between the line number and following identifer/keyword are optional (and not present in most cases).

My current lexer looks like:

// Parser rules
commands      : command*;
command       : LINE_NUM? keyword NEWLINE
              | LINE_NUM? IDENTIFIER NEWLINE;
keyword       : KEYWORD_A | KEYWORD_B | ... ;

// Lexer rules
fragment INT  : [0-9]+;
LINE_NUM      : 'L' INT;
KEYWORD_A     : 'someKeyword';
KEYWORD_B     : 'reservedWord';
...
IDENTIFIER    : [a-zA-Z_][a-zA-Z0-9_]*

However this results in all lines beginning with a LINE_NUM token to be tokenized as IDENTIFIERs.

Is there a way to properly tokenize this input using an ANTLR grammar?

Brachial answered 1/4, 2014 at 14:35 Comment(4)

Are there spaces (or a space) between LINE_NUM and IDENTIFIER? – Greyso 1/4, 2014 at 16:7

@Bart Whitespace between LINE_NUM and IDENTIFIER is optional. I edited the question to clarify. – Brachial 1/4, 2014 at 16:11

Your sample implies (if it's valid) that an identifier may be optionally preceded by a LINE_NUM. The grammar says it's mandatory. Is that right? – Shanell 4/4, 2014 at 7:29

The implied grammar of the sample is the correct behavior. LINE_NUM is optional in both cases. I corrected the grammar. – Brachial 4/4, 2014 at 17:14

U

1

You need to add a semantic predicate to IDENTIFIER:

IDENTIFIER
  : {_input.getCharPositionInLine() != 0
      || _input.LA(1) != 'L'
      || !Character.isDigit(_input.LA(2))}?
    [a-zA-Z_] [a-zA-Z0-9_]*
  ;

You could also avoid semantic predicates by using lexer modes.

//
// Default mode is active at the beginning of a line
//

LINE_NUM
  : 'L' [0-9]+ -> pushMode(NotBeginningOfLine)
  ;

KEYWORD_A : 'someKeyword' -> pushMode(NotBeginningOfLine);
KEYWORD_B : 'reservedWord' -> pushMode(NotBeginningOfLine);
IDENTIFIER
  : ( 'L'
    | 'L' [a-zA-Z_] [a-zA-Z0-9_]*
    | [a-zA-KM-Z_] [a-zA-Z0-9_]*
    )
    -> pushMode(NotBeginningOfLine)
  ;
NL : ('\r' '\n'? | '\n');

mode NotBeginningOfLine;

  NotBeginningOfLine_NL : ('\r' '\n'? | '\n') -> type(NL), popMode;
  NotBeginningOfLine_KEYWORD_A : KEYWORD_A -> type(KEYWORD_A);
  NotBeginningOfLine_KEYWORD_B : KEYWORD_B -> type(KEYWORD_B);
  NotBeginningOfLine_IDENTIFIER
    : [a-zA-Z_] [a-zA-Z0-9_]* -> type(IDENTIFIER)
    ;

Ulick answered 1/4, 2014 at 18:8 Comment(6)

Both methods look good, thanks! Any considerations I should take into account that would push me towards one or the other? – Brachial 1/4, 2014 at 18:57

@HarrisonPaine The lexer interpreter isn't able to evaluate semantic predicates, but lexers in combined grammars can't have multiple modes. If it were me, I would use multiple modes and since I always separate my lexers and parsers anyway. – Ulick 1/4, 2014 at 20:35

I implemented the second approach and realized that I was actually facing a different problem, that wasn't evident in the simplified problem. The actual language is nested in another format, which provides a listing of all defined identifiers. Using the method here: https://mcmap.net/q/918826/-can-i-add-antlr-tokens-at-runtime I can just tokenize each identifier exactly, saving a lot of headaches. Still, thanks for answering the question as I posted it; you definitely helped get me out of a stall. – Brachial 3/4, 2014 at 13:14

@280Z28: Is it planned to remove the restraint to disallow multiple lexer modes in combined lexer/parser grammars? Is there any advantage (besides simplicity) from using a combined grammar? – Shanell 4/4, 2014 at 7:19

@Shanell I never use combined grammars, because they allow you to accidentally define new literal tokens in parser rules, quickly leading hard-to-find bugs. – Ulick 4/4, 2014 at 13:25

@280Z28: Ok, but besides being able to create hard-to-find bugs, is there any advantage in using combined grammars? – Shanell 4/4, 2014 at 13:27

Recommended topics

#Godot #Unity #Godot 4.X #Mongodb

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

© 2022 - 2024 — McMap. All rights reserved.