I've been working on a parser for some template language embeded in HTML (FreeMarker), piece of example here:
${abc}
<html>
<head>
<title>Welcome!</title>
</head>
<body>
<h1>
Welcome ${user}<#if user == "Big Joe">, our beloved
leader</#if>!
</h1>
<p>Our latest product:
<a href="${latestProduct}">${latestProduct}</a>!
</body>
</html>
The template language is between some specific tags, e.g. '${' '}', '<#' '>'. Other raw texts in between can be treated like as the same tokens (RAW).
The key point here is that the same text, e.g. an integer, will mean differently thing for the parser depends on whether it's between those tags or not, and thus needs to be treated as different tokens.
I've tried with the following ugly implementation, with a self-defined state to indicate whether it's in those tags. As you see, I have to check the state almost in every rule, which drives me crazy...
I also thinked about the following two solutions:
Use multiple lexers. I can switch between two lexers when inside/outside those tags. However, the document for this is poor for ANTLR3. I don't know how to let one parser share two different lexers and switch between them.
Move the RAW rule up, after the NUMERICAL_ESCAPE rule. Check the state there, if it's in the tag, put back the token and continue trying the left rules. This would save lots of state checking. However, I don't find any 'put back' function and ANTLR complains about some rules can never be matched...
Is there an elegant solution for this?
grammar freemarker_simple;
@lexer::members {
int freemarker_type = 0;
}
expression
: primary_expression ;
primary_expression
: number_literal | identifier | parenthesis | builtin_variable
;
parenthesis
: OPEN_PAREN expression CLOSE_PAREN ;
number_literal
: INTEGER | DECIMAL
;
identifier
: ID
;
builtin_variable
: DOT ID
;
string_output
: OUTPUT_ESCAPE expression CLOSE_BRACE
;
numerical_output
: NUMERICAL_ESCAPE expression CLOSE_BRACE
;
if_expression
: START_TAG IF expression DIRECTIVE_END optional_block
( START_TAG ELSE_IF expression loose_directive_end optional_block )*
( END_TAG ELSE optional_block )?
END_TAG END_IF
;
list : START_TAG LIST expression AS ID DIRECTIVE_END optional_block END_TAG END_LIST ;
for_each
: START_TAG FOREACH ID IN expression DIRECTIVE_END optional_block END_TAG END_FOREACH ;
loose_directive_end
: ( DIRECTIVE_END | EMPTY_DIRECTIVE_END ) ;
freemarker_directive
: ( if_expression | list | for_each ) ;
content : ( RAW | string_output | numerical_output | freemarker_directive ) + ;
optional_block
: ( content )? ;
root : optional_block EOF ;
START_TAG
: '<#'
{ freemarker_type = 1; }
;
END_TAG : '</#'
{ freemarker_type = 1; }
;
DIRECTIVE_END
: '>'
{
if(freemarker_type == 0) $type=RAW;
freemarker_type = 0;
}
;
EMPTY_DIRECTIVE_END
: '/>'
{
if(freemarker_type == 0) $type=RAW;
freemarker_type = 0;
}
;
OUTPUT_ESCAPE
: '${'
{ if(freemarker_type == 0) freemarker_type = 2; }
;
NUMERICAL_ESCAPE
: '#{'
{ if(freemarker_type == 0) freemarker_type = 2; }
;
IF : 'if'
{ if(freemarker_type == 0) $type=RAW; }
;
ELSE : 'else' DIRECTIVE_END
{ if(freemarker_type == 0) $type=RAW; }
;
ELSE_IF : 'elseif'
{ if(freemarker_type == 0) $type=RAW; }
;
LIST : 'list'
{ if(freemarker_type == 0) $type=RAW; }
;
FOREACH : 'foreach'
{ if(freemarker_type == 0) $type=RAW; }
;
END_IF : 'if' DIRECTIVE_END
{ if(freemarker_type == 0) $type=RAW; }
;
END_LIST
: 'list' DIRECTIVE_END
{ if(freemarker_type == 0) $type=RAW; }
;
END_FOREACH
: 'foreach' DIRECTIVE_END
{ if(freemarker_type == 0) $type=RAW; }
;
FALSE: 'false' { if(freemarker_type == 0) $type=RAW; };
TRUE: 'true' { if(freemarker_type == 0) $type=RAW; };
INTEGER: ('0'..'9')+ { if(freemarker_type == 0) $type=RAW; };
DECIMAL: INTEGER '.' INTEGER { if(freemarker_type == 0) $type=RAW; };
DOT: '.' { if(freemarker_type == 0) $type=RAW; };
DOT_DOT: '..' { if(freemarker_type == 0) $type=RAW; };
PLUS: '+' { if(freemarker_type == 0) $type=RAW; };
MINUS: '-' { if(freemarker_type == 0) $type=RAW; };
TIMES: '*' { if(freemarker_type == 0) $type=RAW; };
DIVIDE: '/' { if(freemarker_type == 0) $type=RAW; };
PERCENT: '%' { if(freemarker_type == 0) $type=RAW; };
AND: '&' | '&&' { if(freemarker_type == 0) $type=RAW; };
OR: '|' | '||' { if(freemarker_type == 0) $type=RAW; };
EXCLAM: '!' { if(freemarker_type == 0) $type=RAW; };
OPEN_PAREN: '(' { if(freemarker_type == 0) $type=RAW; };
CLOSE_PAREN: ')' { if(freemarker_type == 0) $type=RAW; };
OPEN_BRACE
: '{'
{ if(freemarker_type == 0) $type=RAW; }
;
CLOSE_BRACE
: '}'
{
if(freemarker_type == 0) $type=RAW;
if(freemarker_type == 2) freemarker_type = 0;
}
;
IN: 'in' { if(freemarker_type == 0) $type=RAW; };
AS: 'as' { if(freemarker_type == 0) $type=RAW; };
ID : ('A'..'Z'|'a'..'z')+
//{ if(freemarker_type == 0) $type=RAW; }
;
BLANK : ( '\r' | ' ' | '\n' | '\t' )+
{
if(freemarker_type == 0) $type=RAW;
else $channel = HIDDEN;
}
;
RAW
: .
;
EDIT
I found the problem similar to How do I lex this input? , where a "start condition" is needed. But unfortunately, the answer uses a lot of predicates as well, just like my states.
Now, I tried to move the RAW higher with a predicate. Hoping to eliminate all the state checks after RAW rule. However, my example input failed, the first line end is recogonized as BLANK instead of RAW it should be.
I guess something wrong is about the rule priority: After CLOSE_BRACE is matched, the next token is matched from rules after the CLOSE_BRACE rule, rather than start from the begenning again.
Any way to resolve this?
New grammar below with some debug outputs:
grammar freemarker_simple;
@lexer::members {
int freemarker_type = 0;
}
expression
: primary_expression ;
primary_expression
: number_literal | identifier | parenthesis | builtin_variable
;
parenthesis
: OPEN_PAREN expression CLOSE_PAREN ;
number_literal
: INTEGER | DECIMAL
;
identifier
: ID
;
builtin_variable
: DOT ID
;
string_output
: OUTPUT_ESCAPE expression CLOSE_BRACE
;
numerical_output
: NUMERICAL_ESCAPE expression CLOSE_BRACE
;
if_expression
: START_TAG IF expression DIRECTIVE_END optional_block
( START_TAG ELSE_IF expression loose_directive_end optional_block )*
( END_TAG ELSE optional_block )?
END_TAG END_IF
;
list : START_TAG LIST expression AS ID DIRECTIVE_END optional_block END_TAG END_LIST ;
for_each
: START_TAG FOREACH ID IN expression DIRECTIVE_END optional_block END_TAG END_FOREACH ;
loose_directive_end
: ( DIRECTIVE_END | EMPTY_DIRECTIVE_END ) ;
freemarker_directive
: ( if_expression | list | for_each ) ;
content : ( RAW | string_output | numerical_output | freemarker_directive ) + ;
optional_block
: ( content )? ;
root : optional_block EOF ;
START_TAG
: '<#'
{ freemarker_type = 1; }
;
END_TAG : '</#'
{ freemarker_type = 1; }
;
OUTPUT_ESCAPE
: '${'
{ if(freemarker_type == 0) freemarker_type = 2; }
;
NUMERICAL_ESCAPE
: '#{'
{ if(freemarker_type == 0) freemarker_type = 2; }
;
RAW
:
{ freemarker_type == 0 }?=> .
{System.out.printf("RAW \%s \%d\n",getText(),freemarker_type);}
;
DIRECTIVE_END
: '>'
{ if(freemarker_type == 1) freemarker_type = 0; }
;
EMPTY_DIRECTIVE_END
: '/>'
{ if(freemarker_type == 1) freemarker_type = 0; }
;
IF : 'if'
;
ELSE : 'else' DIRECTIVE_END
;
ELSE_IF : 'elseif'
;
LIST : 'list'
;
FOREACH : 'foreach'
;
END_IF : 'if' DIRECTIVE_END
;
END_LIST
: 'list' DIRECTIVE_END
;
END_FOREACH
: 'foreach' DIRECTIVE_END
;
FALSE: 'false' ;
TRUE: 'true' ;
INTEGER: ('0'..'9')+ ;
DECIMAL: INTEGER '.' INTEGER ;
DOT: '.' ;
DOT_DOT: '..' ;
PLUS: '+' ;
MINUS: '-' ;
TIMES: '*' ;
DIVIDE: '/' ;
PERCENT: '%' ;
AND: '&' | '&&' ;
OR: '|' | '||' ;
EXCLAM: '!' ;
OPEN_PAREN: '(' ;
CLOSE_PAREN: ')' ;
OPEN_BRACE
: '{'
;
CLOSE_BRACE
: '}'
{ if(freemarker_type == 2) {freemarker_type = 0;} }
;
IN: 'in' ;
AS: 'as' ;
ID : ('A'..'Z'|'a'..'z')+
{ System.out.printf("ID \%s \%d\n",getText(),freemarker_type);}
;
BLANK : ( '\r' | ' ' | '\n' | '\t' )+
{
System.out.printf("BLANK \%d\n",freemarker_type);
$channel = HIDDEN;
}
;
My input results with the output:
ID abc 2
BLANK 0 <<< incorrect, should be RAW when state==0
RAW < 0 <<< correct
ID html 0 <<< incorrect, should be RAW RAW RAW RAW
RAW > 0
EDIT2
Also tried the 2nd approach with Bart's grammar, still didn't work the 'html' is recognized as an ID, which should be 4 RAWs. When mmode=false, shouldn't RAW get matched first? Or the lexer still chooses the longest match here?
grammar freemarker_bart;
options {
output=AST;
ASTLabelType=CommonTree;
}
tokens {
FILE;
OUTPUT;
RAW_BLOCK;
}
@parser::members {
// merge a given list of tokens into a single AST
private CommonTree merge(List tokenList) {
StringBuilder b = new StringBuilder();
for(int i = 0; i < tokenList.size(); i++) {
Token token = (Token)tokenList.get(i);
b.append(token.getText());
}
return new CommonTree(new CommonToken(RAW, b.toString()));
}
}
@lexer::members {
private boolean mmode = false;
}
parse
: content* EOF -> ^(FILE content*)
;
content
: (options {greedy=true;}: t+=RAW)+ -> ^(RAW_BLOCK {merge($t)})
| if_stat
| output
;
if_stat
: TAG_START IF expression TAG_END raw_block TAG_END_START IF TAG_END -> ^(IF expression raw_block)
;
output
: OUTPUT_START expression OUTPUT_END -> ^(OUTPUT expression)
;
raw_block
: (t+=RAW)* -> ^(RAW_BLOCK {merge($t)})
;
expression
: eq_expression
;
eq_expression
: atom (EQUALS^ atom)*
;
atom
: STRING
| ID
;
// these tokens denote the start of markup code (sets mmode to true)
OUTPUT_START : '${' {mmode=true;};
TAG_START : '<#' {mmode=true;};
TAG_END_START : '</' ('#' {mmode=true;} | ~'#' {$type=RAW;});
RAW : {!mmode}?=> . ;
// these tokens denote the end of markup code (sets mmode to false)
OUTPUT_END : '}' {mmode=false;};
TAG_END : '>' {mmode=false;};
// valid tokens only when in "markup mode"
EQUALS : '==';
IF : 'if';
STRING : '"' ~'"'* '"';
ID : ('a'..'z' | 'A'..'Z')+;
SPACE : (' ' | '\t' | '\r' | '\n')+ {skip();};