Different lexer rules in different state
Asked Answered
H

1

5

I've been working on a parser for some template language embeded in HTML (FreeMarker), piece of example here:

${abc}
<html> 
<head> 
  <title>Welcome!</title> 
</head> 
<body> 
  <h1> 
    Welcome ${user}<#if user == "Big Joe">, our beloved 
leader</#if>! 
  </h1> 
  <p>Our latest product: 
  <a href="${latestProduct}">${latestProduct}</a>! 
</body> 
</html>

The template language is between some specific tags, e.g. '${' '}', '<#' '>'. Other raw texts in between can be treated like as the same tokens (RAW).

The key point here is that the same text, e.g. an integer, will mean differently thing for the parser depends on whether it's between those tags or not, and thus needs to be treated as different tokens.

I've tried with the following ugly implementation, with a self-defined state to indicate whether it's in those tags. As you see, I have to check the state almost in every rule, which drives me crazy...

I also thinked about the following two solutions:

  1. Use multiple lexers. I can switch between two lexers when inside/outside those tags. However, the document for this is poor for ANTLR3. I don't know how to let one parser share two different lexers and switch between them.

  2. Move the RAW rule up, after the NUMERICAL_ESCAPE rule. Check the state there, if it's in the tag, put back the token and continue trying the left rules. This would save lots of state checking. However, I don't find any 'put back' function and ANTLR complains about some rules can never be matched...

Is there an elegant solution for this?

grammar freemarker_simple;

@lexer::members {
int freemarker_type = 0;
}

expression
    :   primary_expression ;

primary_expression
    :   number_literal | identifier | parenthesis | builtin_variable
    ;

parenthesis
    :   OPEN_PAREN expression CLOSE_PAREN ;

number_literal
    :   INTEGER | DECIMAL
    ;

identifier
    :   ID
    ;

builtin_variable
    :   DOT ID
    ;

string_output
    :   OUTPUT_ESCAPE expression CLOSE_BRACE
    ;

numerical_output
    :   NUMERICAL_ESCAPE expression  CLOSE_BRACE
    ;

if_expression
    :   START_TAG IF expression DIRECTIVE_END optional_block
        ( START_TAG ELSE_IF expression loose_directive_end optional_block )*
        ( END_TAG ELSE optional_block )?
        END_TAG END_IF
    ;

list    :   START_TAG LIST expression AS ID DIRECTIVE_END optional_block END_TAG END_LIST ;

for_each
    :   START_TAG FOREACH ID IN expression DIRECTIVE_END optional_block END_TAG END_FOREACH ;

loose_directive_end
    :   ( DIRECTIVE_END | EMPTY_DIRECTIVE_END ) ;

freemarker_directive
    :   ( if_expression | list | for_each  ) ;
content :   ( RAW |  string_output | numerical_output | freemarker_directive ) + ;
optional_block
    :   ( content )? ;

root    :   optional_block EOF  ;

START_TAG
    :   '<#'
        { freemarker_type = 1; }
    ;

END_TAG :   '</#'
        { freemarker_type = 1; }
    ;

DIRECTIVE_END
    :   '>'
        {
        if(freemarker_type == 0) $type=RAW;
        freemarker_type = 0;
        }
    ;
EMPTY_DIRECTIVE_END
    :   '/>'
        {
        if(freemarker_type == 0) $type=RAW;
        freemarker_type = 0;
        }
    ;

OUTPUT_ESCAPE
    :   '${'
        { if(freemarker_type == 0) freemarker_type = 2; }
    ;
NUMERICAL_ESCAPE
    :   '#{'
        { if(freemarker_type == 0) freemarker_type = 2; }
    ;

IF  :   'if'
        { if(freemarker_type == 0) $type=RAW; }
    ;
ELSE    :   'else' DIRECTIVE_END
        { if(freemarker_type == 0) $type=RAW; }
    ; 
ELSE_IF :   'elseif'
        { if(freemarker_type == 0) $type=RAW; }
    ; 
LIST    :   'list'
        { if(freemarker_type == 0) $type=RAW; }
    ; 
FOREACH :   'foreach'
        { if(freemarker_type == 0) $type=RAW; }
    ; 
END_IF  :   'if' DIRECTIVE_END
        { if(freemarker_type == 0) $type=RAW; }
    ; 
END_LIST
    :   'list' DIRECTIVE_END
        { if(freemarker_type == 0) $type=RAW; }
    ; 
END_FOREACH
    :   'foreach' DIRECTIVE_END
        { if(freemarker_type == 0) $type=RAW; }
    ;


FALSE: 'false' { if(freemarker_type == 0) $type=RAW; };
TRUE: 'true' { if(freemarker_type == 0) $type=RAW; };
INTEGER: ('0'..'9')+ { if(freemarker_type == 0) $type=RAW; };
DECIMAL: INTEGER '.' INTEGER { if(freemarker_type == 0) $type=RAW; };
DOT: '.' { if(freemarker_type == 0) $type=RAW; };
DOT_DOT: '..' { if(freemarker_type == 0) $type=RAW; };
PLUS: '+' { if(freemarker_type == 0) $type=RAW; };
MINUS: '-' { if(freemarker_type == 0) $type=RAW; };
TIMES: '*' { if(freemarker_type == 0) $type=RAW; };
DIVIDE: '/' { if(freemarker_type == 0) $type=RAW; };
PERCENT: '%' { if(freemarker_type == 0) $type=RAW; };
AND: '&' | '&&' { if(freemarker_type == 0) $type=RAW; };
OR: '|' | '||' { if(freemarker_type == 0) $type=RAW; };
EXCLAM: '!' { if(freemarker_type == 0) $type=RAW; };
OPEN_PAREN: '(' { if(freemarker_type == 0) $type=RAW; };
CLOSE_PAREN: ')' { if(freemarker_type == 0) $type=RAW; };
OPEN_BRACE
    :   '{'
    { if(freemarker_type == 0) $type=RAW; }
    ;
CLOSE_BRACE
    :   '}'
    {
        if(freemarker_type == 0) $type=RAW;
        if(freemarker_type == 2) freemarker_type = 0;
    }
    ;
IN: 'in' { if(freemarker_type == 0) $type=RAW; };
AS: 'as' { if(freemarker_type == 0) $type=RAW; };
ID  :   ('A'..'Z'|'a'..'z')+
    //{ if(freemarker_type == 0) $type=RAW; }
    ;

BLANK   :   ( '\r' | ' ' | '\n' | '\t' )+
    {
        if(freemarker_type == 0) $type=RAW;
        else $channel = HIDDEN;
    }
    ;

RAW
    :   .
    ;

EDIT

I found the problem similar to How do I lex this input? , where a "start condition" is needed. But unfortunately, the answer uses a lot of predicates as well, just like my states.

Now, I tried to move the RAW higher with a predicate. Hoping to eliminate all the state checks after RAW rule. However, my example input failed, the first line end is recogonized as BLANK instead of RAW it should be.

I guess something wrong is about the rule priority: After CLOSE_BRACE is matched, the next token is matched from rules after the CLOSE_BRACE rule, rather than start from the begenning again.

Any way to resolve this?

New grammar below with some debug outputs:

grammar freemarker_simple;

@lexer::members {
int freemarker_type = 0;
}

expression
    :   primary_expression ;

primary_expression
    :   number_literal | identifier | parenthesis | builtin_variable
    ;

parenthesis
    :   OPEN_PAREN expression CLOSE_PAREN ;

number_literal
    :   INTEGER | DECIMAL
    ;

identifier
    :   ID
    ;

builtin_variable
    :   DOT ID
    ;

string_output
    :   OUTPUT_ESCAPE expression CLOSE_BRACE
    ;

numerical_output
    :   NUMERICAL_ESCAPE expression  CLOSE_BRACE
    ;

if_expression
    :   START_TAG IF expression DIRECTIVE_END optional_block
        ( START_TAG ELSE_IF expression loose_directive_end optional_block )*
        ( END_TAG ELSE optional_block )?
        END_TAG END_IF
    ;

list    :   START_TAG LIST expression AS ID DIRECTIVE_END optional_block END_TAG END_LIST ;

for_each
    :   START_TAG FOREACH ID IN expression DIRECTIVE_END optional_block END_TAG END_FOREACH ;

loose_directive_end
    :   ( DIRECTIVE_END | EMPTY_DIRECTIVE_END ) ;

freemarker_directive
    :   ( if_expression | list | for_each  ) ;
content :   ( RAW |  string_output | numerical_output | freemarker_directive ) + ;
optional_block
    :   ( content )? ;

root    :   optional_block EOF  ;

START_TAG
    :   '<#'
        { freemarker_type = 1; }
    ;

END_TAG :   '</#'
        { freemarker_type = 1; }
    ;

OUTPUT_ESCAPE
    :   '${'
        { if(freemarker_type == 0) freemarker_type = 2; }
    ;
NUMERICAL_ESCAPE
    :   '#{'
        { if(freemarker_type == 0) freemarker_type = 2; }
    ;
RAW
    :
        { freemarker_type == 0 }?=> .
        {System.out.printf("RAW \%s \%d\n",getText(),freemarker_type);}
    ;

DIRECTIVE_END
    :   '>'
        { if(freemarker_type == 1) freemarker_type = 0; }
    ;
EMPTY_DIRECTIVE_END
    :   '/>'
        { if(freemarker_type == 1) freemarker_type = 0; }
    ;

IF  :   'if'

    ;
ELSE    :   'else' DIRECTIVE_END

    ; 
ELSE_IF :   'elseif'

    ; 
LIST    :   'list'

    ; 
FOREACH :   'foreach'

    ; 
END_IF  :   'if' DIRECTIVE_END
    ; 
END_LIST
    :   'list' DIRECTIVE_END
    ; 
END_FOREACH
    :   'foreach' DIRECTIVE_END
    ;


FALSE: 'false' ;
TRUE: 'true' ;
INTEGER: ('0'..'9')+ ;
DECIMAL: INTEGER '.' INTEGER ;
DOT: '.' ;
DOT_DOT: '..' ;
PLUS: '+' ;
MINUS: '-' ;
TIMES: '*' ;
DIVIDE: '/' ;
PERCENT: '%' ;
AND: '&' | '&&' ;
OR: '|' | '||' ;
EXCLAM: '!' ;
OPEN_PAREN: '(' ;
CLOSE_PAREN: ')' ;
OPEN_BRACE
    :   '{'
    ;
CLOSE_BRACE
    :   '}'
    { if(freemarker_type == 2) {freemarker_type = 0;} }
    ;
IN: 'in' ;
AS: 'as' ;
ID  :   ('A'..'Z'|'a'..'z')+
    { System.out.printf("ID \%s \%d\n",getText(),freemarker_type);}
    ;

BLANK   :   ( '\r' | ' ' | '\n' | '\t' )+
    {
        System.out.printf("BLANK \%d\n",freemarker_type);
        $channel = HIDDEN;
    }
    ;

My input results with the output:

ID abc 2
BLANK 0  <<< incorrect, should be RAW when state==0
RAW < 0  <<< correct
ID html 0 <<< incorrect, should be RAW RAW RAW RAW
RAW > 0

EDIT2

Also tried the 2nd approach with Bart's grammar, still didn't work the 'html' is recognized as an ID, which should be 4 RAWs. When mmode=false, shouldn't RAW get matched first? Or the lexer still chooses the longest match here?

grammar freemarker_bart;

options {
  output=AST;
  ASTLabelType=CommonTree;
}

tokens {
  FILE;
  OUTPUT;
  RAW_BLOCK;
}

@parser::members {

  // merge a given list of tokens into a single AST
  private CommonTree merge(List tokenList) {
    StringBuilder b = new StringBuilder();
    for(int i = 0; i < tokenList.size(); i++) {
      Token token = (Token)tokenList.get(i);
      b.append(token.getText());
    }
    return new CommonTree(new CommonToken(RAW, b.toString()));
  }
}

@lexer::members {
  private boolean mmode = false;
}

parse
  :  content* EOF -> ^(FILE content*)
  ;

content
  :  (options {greedy=true;}: t+=RAW)+ -> ^(RAW_BLOCK {merge($t)})
  |  if_stat
  |  output
  ;

if_stat
  :  TAG_START IF expression TAG_END raw_block TAG_END_START IF TAG_END -> ^(IF expression raw_block)
  ;

output
  :  OUTPUT_START expression OUTPUT_END -> ^(OUTPUT expression)
  ;

raw_block
  :  (t+=RAW)* -> ^(RAW_BLOCK {merge($t)})
  ;

expression
  :  eq_expression
  ;

eq_expression
  :  atom (EQUALS^ atom)* 
  ;

atom
  :  STRING
  |  ID
  ;

// these tokens denote the start of markup code (sets mmode to true)
OUTPUT_START  : '${'  {mmode=true;};
TAG_START     : '<#'  {mmode=true;};
TAG_END_START : '</' ('#' {mmode=true;} | ~'#' {$type=RAW;});

RAW           : {!mmode}?=> . ;

// these tokens denote the end of markup code (sets mmode to false)
OUTPUT_END    : '}' {mmode=false;};
TAG_END       : '>' {mmode=false;};

// valid tokens only when in "markup mode"
EQUALS        : '==';
IF            : 'if';
STRING        : '"' ~'"'* '"';
ID            : ('a'..'z' | 'A'..'Z')+;
SPACE         : (' ' | '\t' | '\r' | '\n')+ {skip();};
Homestretch answered 22/10, 2011 at 17:53 Comment(0)
L
2

You could let lexer rules match using gated semantic predicates where you test for a certain boolean expression.

A little demo:

freemarker_simple.g

grammar freemarker_simple;

options {
  output=AST;
  ASTLabelType=CommonTree;
}

tokens {
  FILE;
  OUTPUT;
  RAW_BLOCK;
}

@parser::members {

  // merge a given list of tokens into a single AST
  private CommonTree merge(List tokenList) {
    StringBuilder b = new StringBuilder();
    for(int i = 0; i < tokenList.size(); i++) {
      Token token = (Token)tokenList.get(i);
      b.append(token.getText());
    }
    return new CommonTree(new CommonToken(RAW, b.toString()));
  }
}

@lexer::members {
  private boolean mmode = false;
}

parse
  :  content* EOF -> ^(FILE content*)
  ;

content
  :  (options {greedy=true;}: t+=RAW)+ -> ^(RAW_BLOCK {merge($t)})
  |  if_stat
  |  output
  ;

if_stat
  :  TAG_START IF expression TAG_END raw_block TAG_END_START IF TAG_END -> ^(IF expression raw_block)
  ;

output
  :  OUTPUT_START expression OUTPUT_END -> ^(OUTPUT expression)
  ;

raw_block
  :  (t+=RAW)* -> ^(RAW_BLOCK {merge($t)})
  ;

expression
  :  eq_expression
  ;

eq_expression
  :  atom (EQUALS^ atom)* 
  ;

atom
  :  STRING
  |  ID
  ;

// these tokens denote the start of markup code (sets mmode to true)
OUTPUT_START  : '${'  {mmode=true;};
TAG_START     : '<#'  {mmode=true;};
TAG_END_START : '</' ('#' {mmode=true;} | ~'#' {$type=RAW;});

// these tokens denote the end of markup code (sets mmode to false)
OUTPUT_END    : {mmode}?=> '}' {mmode=false;};
TAG_END       : {mmode}?=> '>' {mmode=false;};

// valid tokens only when in "markup mode"
EQUALS        : {mmode}?=> '==';
IF            : {mmode}?=> 'if';
STRING        : {mmode}?=> '"' ~'"'* '"';
ID            : {mmode}?=> ('a'..'z' | 'A'..'Z')+;
SPACE         : {mmode}?=> (' ' | '\t' | '\r' | '\n')+ {skip();};

RAW           : . ;

which parses your input:

test.html

${abc}
<html> 
<head> 
  <title>Welcome!</title> 
</head> 
<body> 
  <h1> 
    Welcome ${user}<#if user == "Big Joe">, our beloved leader</#if>! 
  </h1> 
  <p>Our latest product: <a href="${latestProduct}">${latestProduct}</a>!</p>
</body> 
</html>

into the following AST:

enter image description here

as you can test yourself with the class:

Main.java

import org.antlr.runtime.*;
import org.antlr.runtime.tree.*;
import org.antlr.stringtemplate.*;

public class Main {
  public static void main(String[] args) throws Exception {
    freemarker_simpleLexer lexer = new freemarker_simpleLexer(new ANTLRFileStream("test.html"));
    freemarker_simpleParser parser = new freemarker_simpleParser(new CommonTokenStream(lexer));
    CommonTree tree = (CommonTree)parser.parse().getTree();
    DOTTreeGenerator gen = new DOTTreeGenerator();
    StringTemplate st = gen.toDOT(tree);
    System.out.println(st);
  }
}

EDIT 1

When I run your example input with a parser generated from the second grammar you posted, the following are wthe first 5 lines being printed to the console (not counting the many warnings that are generated):

ID abc 2
RAW 
 0
RAW < 0
ID html 0
...

EDIT 2

Bood wrote:

Also tried the 2nd approach with Bart's grammar, still didn't work the 'html' is recognized as an ID, which should be 4 RAWs. When mmode=false, shouldn't RAW get matched first? Or the lexer still chooses the longest match here?

Yes, that is correct: ANTLR chooses the longer match in that case.

But now that I (finally :)) see what you're trying to do, here's a last proposal: you could let the RAW rule match characters as long as the rule can't see one of the following character sequences ahead: "<#", "</#" or "${". Note that the rule must still stay at the end in the grammar. This check is performed inside the lexer. Also, in that case you don't need the merge(...) method in the parser:

grammar freemarker_simple;

options {
  output=AST;
  ASTLabelType=CommonTree;
}

tokens {
  FILE;
  OUTPUT;
  RAW_BLOCK;
}

@lexer::members {
  
  private boolean mmode = false;
  
  private boolean rawAhead() {
    if(mmode) return false;
    int ch1 = input.LA(1), ch2 = input.LA(2), ch3 = input.LA(3);
    return !(
        (ch1 == '<' && ch2 == '#') ||
        (ch1 == '<' && ch2 == '/' && ch3 == '#') ||
        (ch1 == '$' && ch2 == '{')
    );
  }
}

parse
  :  content* EOF -> ^(FILE content*)
  ;

content
  :  RAW
  |  if_stat
  |  output
  ;

if_stat
  :  TAG_START IF expression TAG_END RAW TAG_END_START IF TAG_END -> ^(IF expression RAW)
  ;

output
  :  OUTPUT_START expression OUTPUT_END -> ^(OUTPUT expression)
  ;

expression
  :  eq_expression
  ;

eq_expression
  :  atom (EQUALS^ atom)*
  ;

atom
  :  STRING
  |  ID
  ;

OUTPUT_START  : '${'  {mmode=true;};
TAG_START     : '<#'  {mmode=true;};
TAG_END_START : '</' ('#' {mmode=true;} | ~'#' {$type=RAW;});

OUTPUT_END    : '}' {mmode=false;};
TAG_END       : '>' {mmode=false;};

EQUALS        : '==';
IF            : 'if';
STRING        : '"' ~'"'* '"';
ID            : ('a'..'z' | 'A'..'Z')+;
SPACE         : (' ' | '\t' | '\r' | '\n')+ {skip();};

RAW           : ({rawAhead()}?=> . )+;

The grammar above will produce the following AST from the input posted at the start of this answer:

enter image description here

Loralyn answered 23/10, 2011 at 11:35 Comment(7)
Well, in this case, I have to use a predicate for all the possible tokens in the "markup mode" right? I think it's essentially same as my "freemarker_type" state as I stated in the question. But still appreciate much since you provided a vivid version, with AST generation and RAW token merge, which is all I have to find out later anyway, so very helpful, thanks :)Homestretch
Wondering is there any more elegant way to do this? Do you know why my second approach failed?Homestretch
Yes, it is essentially the same as you mentioned, but I posted a working version since you said yours wasn't working (or there were errors/warnings?). And yes, you'd have to put a predicate in front of the tokens belonging to your "inner language". And no, I don't think you're going to get anything more elegant since you're essentially cramming two languages in a single grammar. And you're welcome, of course.Loralyn
No, I glanced over your second approach, but didn't immediately get it, sorry.Loralyn
What I'm doing in the 2nd approach is to increase the priority of the RAW rule, hoping it get matched before most other tokens when not in mmode. Thus I can remove those predicates for the "most other tokens" since ANTLR will get to them only when the predicate of the RAW rule fails, i.e. in the mmode.Homestretch
@Bood, ah okay, I see what mean (by your latest comment, not your grammar... :)). You could try to do it like that based on my example, which works, and does not produce any warnings (your grammar produces quite some warnings, so it's not really trustworthy. Also see my EDIT).Loralyn
I think, the StringTemplate using a custom inside-outside mode lexer is an elegant way.Legend

© 2022 - 2024 — McMap. All rights reserved.