How to use ANTLR v4 for syntax highlighting?

Asked 9/3, 2015 at 12:36 Answered 10/3, 2015 at 7:59

I've built a grammar for a DSL and I'd like to display some elements (table names) in some colors. I output HTML from Java.

columnIdentifier :
      columnName=Identifier
    | tableName=Identifier '.' columnName=Identifier
    ;

Identifier : Letter LetterOrDigit* ;
fragment Letter : [a-zA-Z_];
fragment LetterOrDigit : [a-zA-Z0-9_];

WS  :  [ \t\r\n\u000C]+ -> skip ;

I thought about using an AbstractParseTreeVisitor to return all elements as-is, except those I want to highlight which would be returned as <span class="tablename" >theoriginaltext</span>. Is it a good approach?

Note the whitespaces are dismissed before they are sent to the parser, correct? So if I rebuilt the output using an AbstractParseTreeVisitor, I can't rebuild the spaces.

I assume there's a canonical way of doing syntax highlighting with ANTLR4. It's difficult to find information about this because searches often return results about highlighting the Antlr4 files in Eclipse/IDEA.

Decor answered 9/3, 2015 at 12:36 Comment(0)

The Definitive ANTLR4 Reference contains the answer at paragraph 12.1:

Instead of skipping whitespace, send it to a hidden channel:

WS  :  [ \t\r\n\u000C]+ -> channel(HIDDEN)
    ;

Then the whitespace is still ignored in the context of the grammar (which is what we want) and getTranslatedText() will successfully return all text including whitespace. Use a listener such as:

public static class HtmlHighlighterListener extends MyDSLBaseListener {
    private final CommonTokenStream tokens;
    private final TokenStreamRewriter rewriter;

    public HtmlHighlighterListener(CommonTokenStream tokens) {
        this.tokens = tokens;
        this.rewriter = new TokenStreamRewriter(tokens);
    }

    ... Place here the overrides of "enterEveryRule" and "exitEveryRule"

    public String getTranslatedText() {
        return rewriter.getText();
    }
}

ParseTreeWalker walker = new ParseTreeWalker();
HtmlHighlighterListener listener = new HtmlHighlighterListener(tokens);
walker.walk(listener, tree);
return listener.getTranslatedText();

Then you can override "enterEveryRule" and "exitEveryRule" to add HTML tags for coloration.

Decor answered 9/3, 2015 at 16:3 Comment(0)

If you exchange the skip in the WS rule with hidden, the tokens for whitespace are generated and accessible later on.

If you are looking for an easy way to identify certain structures in your parse tree (not AST, which is currently not available in ANTLR4), you may have a look at parse tree pattern matching.

During the matching you should note which token index contains the table names and afterwards you walk through all tokens (also whitespace) and modify the ones that are table names.

Goldsworthy answered 10/3, 2015 at 7:59 Comment(0)

Why do you need Visitor? Or even a parser? Wouldn't pure lexer be sufficient to token classification?

When I was working code indentation tool I had to

generate AST
clone AST into mine "cooked" tree where (nearly) every leaf held some lexer token
then I iterated over lexer (again) and had to append all the hidden to tokens to their closest non-hidden tokens. To those which were present in the AST tree.
- imagine you have two ordered token streams, one is the lexer one, the other is recursive walk over the AST tree. And you join these two "streams" into 3rd one - producing an AST tree having also having hidden tokens
when I recursively walked through tree an was able to reconstruct the whole input text - including comments and white spaces.

PS: this can work only if AST construction never switches the order of child nodes.

Lenient answered 9/3, 2015 at 13:3 Comment(6)

Antlr4 doesn't have ASTs anymore. I can't just use a lexer because it's not as simple as just coloring table names: I also have functions and I'd like to check their names and number of arguments before displaying them in red.https://mcmap.net/q/1009370/-build-ast-in-antlr4 – Decor 9/3, 2015 at 14:34

@Adrien: but ANTLR4 does have trees that it will build as you parse. For many purposes, these are just as good. – Elwin 9/3, 2015 at 14:51

@IraBaxter The tree you mention is the one that is browsed by AbstractParseTreeVisitor. It is one plausible method, but we need to find a way to keep the whitespace characters in the tree. – Decor 9/3, 2015 at 15:8

What propose is to walk the tree "manually" - recursively and also in parallel iterate over the lexer token stream. So you can "map" all the skipped hidden tokens to thier corresponding tree leaves. – Lenient 9/3, 2015 at 15:11

@Adrien: I can't speak for what in ANTLR trees, but if it is a decent parsing engine, it should capture source position (including line and column) information. The differences between the columns tells you where whitespace should be. – Elwin 9/3, 2015 at 15:21

@IraBaxter I know I don't have enough background about parser theory, but in this situation the question was really about ANTLR 4 and how to leverage it. Thank you very much for your attempt! – Decor 9/3, 2015 at 16:4

Recommended topics

Hot tags