Using ANTLR Parser and Lexer Separatly
Asked Answered
E

1

16

I used ANTLR version 4 for creating compiler.First Phase was the Lexer part. I created "CompilerLexer.g4" file and putted lexer rules in it.It works fine.

CompilerLexer.g4:


lexer grammar CompilerLexer;

INT         :   'int'   ;   //1
FLOAT       :   'float' ;   //2
BEGIN       :   'begin' ;   //3
END         :   'end'   ;   //4
To          :   'to'    ;   //5
NEXT        :   'next'  ;   //6
REAL        :   'real'  ;   //7
BOOLEAN     :   'bool'  ;   //8
.
.
.
NOTEQUAL    :   '!='    ;   //46
AND         :   '&&'    ;   //47
OR          :   '||'    ;   //48
POW         :   '^'     ;   //49
ID          : [a-zA-Z]+ ;   //50




WS
:   ' ' -> channel(HIDDEN)  //50
;

Now it is time for phase 2 which is the parser.I created "CompilerParser.g4" file and putted grammars in it but have dozens warning and errors.

CompilerParser.g4:


parser grammar CompilerParser;

options {   tokenVocab = CompilerLexer; }

STATEMENT   :   EXPRESSION SEMIC
        |   IFSTMT
        |   WHILESTMT
        |   FORSTMT
        |   READSTMT SEMIC
        |   WRITESTMT SEMIC
        |   VARDEF SEMIC
        |   BLOCK
        ;

BLOCK       : BEGIN STATEMENTS END
        ;

STATEMENTS  : STATEMENT STATEMENTS*
        ;

EXPRESSION  : ID ASSIGN EXPRESSION
        | BOOLEXP
        ;

RELEXP      : MODEXP (GT | LT | EQUAL | NOTEQUAL | LE | GE | AND | OR) RELEXP
        | MODEXP
        ;

.
.
.

VARDEF      : (ID COMA)* ID COLON VARTYPE
        ;

VARTYPE     : INT
        | FLOAT
        | CHAR
        | STRING
        ;
compileUnit
:   EOF
;

Warning and errors:

  • implicit definition of token 'BLOCK' in parser
  • implicit definition of token 'BOOLEXP' in parser
  • implicit definition of token 'EXP' in parser
  • implicit definition of token 'EXPLIST' in parser
  • lexer rule 'BLOCK' not allowed in parser
  • lexer rule 'EXP' not allowed in parser
  • lexer rule 'EXPLIST' not allowed in parser
  • lexer rule 'EXPRESSION' not allowed in parser

Have dozens of these warning and errors. What is the cause?

General Questions: What is difference between using combined grammar and using lexer and parser separately? How should join separate grammar and lexer files?

Et answered 19/6, 2014 at 5:15 Comment(0)
R
20

Lexer rules start with a capital letter, and parser rules start with a lowercase letter. In a parser grammar, you can't define tokens. And since ANTLR thinks all your upper-cased rules lexer rules, it produces theses errors/warning.

EDIT

user2998131 wrote:

General Questions: What is difference between using combined grammar and using lexer and parser separately?

Separating the lexer and parser rules will keeps things organized. Also, when creating separate lexer and parser grammars, you can't (accidentally) put literal tokens inside your parser grammar but will need to define all tokens in your lexer grammar. This will make it apparent which lexer rules get matched before others, and you can't make any typo's inside recurring literal tokens:

grammar P;

r1 : 'foo' r2;

r2 : r3 'foo '; // added an accidental space after 'foo'

But when you have a parser grammar, you can't make that mistake. You will have to use the lexer rule that matches 'foo':

parser grammar P

options { tokenVocab=L; }

r1 : FOO r2;

r2 : r3 FOO;


lexer grammar L;

FOO : 'foo';

user2998131 wrote:

How should join separate grammar and lexer files?

Just like you do in your parser grammar: you point to the proper tokenVocab inside the options { ... } block.

Note that you can also import grammars, which is something different: https://github.com/antlr/antlr4/blob/master/doc/grammars.md#grammar-imports

Ribbentrop answered 19/6, 2014 at 6:26 Comment(12)
@user2998131, ah, missed those. Will answer those at a later time.Ribbentrop
If I could go a little bit further, writing a combined grammar means the language is pushing you to write context-sensitive lexer rules. These are antithetical to the way most lexers, including ANTLR's lexer, work. In my case --as is likely a common case-- by using a combined grammar I was adding keywords in a number of places which removed the set of strings for my general ID lexer rule. With split lexer/parser grammar files, this becomes really obvious, since you now must declare a lexer entry for each keyword, and that re-emphasizes the lack of context the lexer must operate under.Ananna
The link is deadRectus
@Rectus changed the linkRibbentrop
This might be a little irrelevant, but if tokenVocab is used to combine separated lexer and parser rules, when is import supposed to be used?Hillis
@wlnirvana, no, tokenVocab is not used to combine grammars, it is used to point a parser grammar to its token definitions (lexer grammar).Ribbentrop
This might be just a matter of terminology, but what are "combined" at all? I thought it simply means a style where lexer and parser rules are written in the same file. But your comment suggests that achieving the same effect is not necessarily "to combine", but could be a result of "to point a parser grammar to its token definitions". Any difference between these two approaches?Hillis
Yes, a "combined" grammar has both lexer rules and parser rules in 1 grammar file. Using tokenVocab inside a parser grammar (which you must do) will let you point your parser grammar to the lexer rules the parser grammar needs. Importing grammars is something a combined- parser- or lexer grammar can do besides all that.Ribbentrop
@BartKiers why does practically every BNF for a language that I look up have a combined grammar and not separate rules when the lexer and parser are separate in the actual compilers.. gist.github.com/arslancharyev31/… cs.wmich.edu/~gupta/teaching/cs4850/sumII06/… void and char etc. here are input characters and not tokens surely. This seems to be the same thing with python docs.python.org/3/reference/grammar.htmlAntiquity
BNF often functions as documentation, in which case it makes sense to keep things compact.Ribbentrop
When you have plenty of [WARNING] warning(125): EventsParser.g4:7:20: implicit definition of token SEMICOLON in parser like I did just add the options {tokenVocab= EventsLexer.g4; }. It finally saved my day after too much hours searching for a way to link a grammer with a lexer and didn't understand where all the warnings came from.Leesen
@Leesen note that it is tokenVocab=EventsLexer;, not tokenVocab=EventsLexer.g4;Ribbentrop

© 2022 - 2024 — McMap. All rights reserved.