Allow Whitespace sections ANTLR4
Asked Answered
G

1

2

I have an antlr4 grammar designed for an a domain specific language that is embedded into a text template.

There are two modes:

  • Text (whitespace should be preserved)
  • Code (whitespace should be ignored)

Sample grammar part:

template
  :  '{' templateBody '}'
  ;

templateBody
  : templateChunk*
  ;

templateChunk
  : code # codeChunk // dsl code, ignore whitespace
  | text # textChunk // any text, preserve whitespace
  ;

The rule for code may contain a nested reference to the template rule. So the parser must support nesting whitespace/non-whitespace sections.

Maybe lexer modes can help - with some drawbacks:

  • the code sections must be parsed in another compiler pass
  • I doubt that nested sections could be mapped correctly

Yet the most promising approach seems to be the manipulation of hidden channels.

My question: Is there a best practice to fill these requirements? Is there an example grammar, that has already solved similar problems?

Appendix:

The rest of the grammar could look as following:

code
  : '@' function
  ;

function
  : Identifier '(' argument ')'
  ;

argument
  : function
  | template
  ;

text
  : Whitespace+
  | Identifier
  | .+
  ;

Identifier
  : LETTER (LETTER|DIGIT)*
  ;

Whitespace
  : [ \t\n\r] -> channel(HIDDEN)
  ;

fragment LETTER
    : [a-zA-Z]
    ;

fragment DIGIT
    : [0-9]
    ;

In this example code has a dummy implementation pointing out that it can contain nested code/template sections. Actually code should support

  • multiple arguments
  • primitive type Arguments (ints, strings, ...)
  • maps and lists
  • function evaluation
  • ...
Glib answered 15/3, 2015 at 12:17 Comment(2)
You can push/pop lexer modes, so you should be fine with them. But you'd have to post the code and text rules so we can see if you really need a second pass or not.Toshiatoshiko
This probably works fine if the delimiters of text are context insensitive (i.e. if any occurance of these delimiters opens/closes a text section). I gets difficult, if it depends on the parser state whether the delimiters delimit a text or another language structure.Glib
G
9

This is how I solved the problem at the end:

The idea is to enable/disable whitespace in a parser rule:

 templateBody : {enableWs();} templateChunk* {disableWs();};

So we will have to define enableWs and disableWs in our parser base class:

public void enableWs() {
    if (_input instanceof MultiChannelTokenStream) {
        ((MultiChannelTokenStream) _input).enable(HIDDEN);
    }
}

public void disableWs() {
    if (_input instanceof MultiChannelTokenStream) {
        ((MultiChannelTokenStream) _input).disable(HIDDEN);
    }
}

Now what is this MultiChannelTokenStream?

  • Antlr4 defines a CommonTokenStream which is a token stream reading only from one channel.
  • MultiChannelTokenStream is a token stream reading from the enabled channels. For implementation I took the source code of CommonTokenStream and replaced each reference to the channel by channels (equality comparison gets contains comparison)

An example implementation with the grammar above could be found at antlr4multichannel

Glib answered 18/3, 2015 at 6:28 Comment(1)
Nice solution. Here's a different approach if you always need the full text, but your solution allows for more scenarios (for instance you could enable whitespace but disable comments).Toshiatoshiko

© 2022 - 2024 — McMap. All rights reserved.