Matching arbitrary text (both symbols and spaces) with ANTLR?

Asked 11/5, 2013 at 11:24 Answered 13/5, 2013 at 14:3

How to match any text in ANTLRv4? I mean text, which is unknown at the time of grammar writing?

My grammar is follows:

grammar Anytext;

line :
    comment;

comment : '#' anytext;

anytext: ANY*;

WS : [ \t\r\n]+;

ANY : .;

And my code is follows:

    String line = "# This_is_a_comment";

    ANTLRInputStream input = new ANTLRInputStream(line);

    AnytextLexer lexer = new AnytextLexer(input);

    CommonTokenStream tokens = new CommonTokenStream(lexer);

    AnytextParser parser = new AnytextParser(tokens);

    ParseTree tree = parser.comment();

    System.out.println(tree.toStringTree(parser)); // print LISP-style tree

Output follows:

line 1:1 extraneous input ' ' expecting {<EOF>, ANY}
(comment # (anytext   T h i s _ i s _ a _ c o m m e n t))

If I change ANY rule

ANY : [ \t\r\n.];

it stops recognizing any symbol at all.

UPDATE1

I have no end line character at the end.

UPDATE 2

So, I understood, that it is impossible to match any text with lexer since lexer can't allow multiple classes. If I define lexer rule for any symbol it will either hide all other rules or doesn't work.

But the question persists.

How to match all symbols at parser level then?

Suppose I have table-shaped data and I wan't to process some fields and ignore others. If I had anytext rule, I would write

infoline :
    ( codepoint WS 'field1' WS field1Value ) |
    ( codepoint WS 'field2' WS field2Value ) |
    ( codepoint WS anytext );

here I am parsing rows if 2nd column contains field1 and field2 values and ignore rows otherwise.

How to accomplish this approach?

Histoplasmosis answered 11/5, 2013 at 11:24 Comment(0)

It's important to remember that ANTLR will break up your complete input into tokens before the parser ever sees the first token (at least it behaves this way). Your lexer grammar looks like the following.

T__0 : '#'; // implicit token created due to the use of '#' in parser rule comment

WS : [ \t\r\n]+;

ANY : .;

For your input, the tokens are the following:

# (type T__0)
[space] (type WS)
T (type ANY)
h (type ANY)
i (type ANY)
s (type ANY)
_ (type ANY)
i (type ANY)
s (type ANY)
_ (type ANY)
a (type ANY)
_ (type ANY)
c (type ANY)
o (type ANY)
m (type ANY)
m (type ANY)
e (type ANY)
n (type ANY)
t (type ANY)

Your current grammar fails to parse because the WS token isn't allowed in the comment rule. It would parse this input (but may run into problems as you expand your grammar) if you used this:

// remember that '#' is its own token
anytext: (ANY | WS | '#')*;

What you could do is change comment to be a lexer rule, which consumes the # character along with whatever follows (in this case, to the end of the line):

grammar Anytext;

line : COMMENT;

COMMENT : '#' ~[\r\n]*;

WS : [ \t\r\n]+;

ANY : .;

Michellemichels answered 13/5, 2013 at 14:3 Comment(4)

I don't understand, why you wrote [space] (type WS). From my point of view it is also ANY? Why not? – Histoplasmosis 13/5, 2013 at 17:16

@SuzanCioc ANTLR never assigns more than one type to a token. The space character matches the rule WS and ANY. To resolve the ambiguity, since WS appears before ANY in the grammar the token is assigned the WS type. The ambiguity is resolved and the token type assigned before the parser sees the token, so the parser will never see a space character token with the type ANY. – Michellemichels 13/5, 2013 at 17:41

What about trees? They are also prohibited in lexer? What if I write WS : [ \t\r\n]; ANY : WS | .;? Will space be marked both with ANY and WS? – Histoplasmosis 13/5, 2013 at 18:2

I this is true, then this is the answer: lexer does not allow ambiguity and trees. – Histoplasmosis 13/5, 2013 at 18:4

Use following rule for line comments:

LINE_COMMENT
    :   '#' ~('\n'|'\r')* '\r'? '\n' {$channel=HIDDEN;}
    ;

It matches '#' and any symbol until it gets to the end of line (unix/windows line breaks).

Edit by 280Z28: here is the exact same rule in ANTLR 4 syntax:

LINE_COMMENT
    :   '#' ~[\r\n]* '\r'? '\n' -> channel(HIDDEN)
    ;

Ombre answered 11/5, 2013 at 14:3 Comment(7)

I edited your post to give exactly the same rule in ANTLR 4 syntax. On a separate note, I recommend not including the '\r'? '\n' line terminator as part of the LINE_COMMENT rule itself (make it consume characters up to, but not including the end of line). There are a few reasons I recommend this, but the biggest is the fact that in the current form LINE_COMMENT will not match a comment on the last line of a file if it's not followed by an explicit line terminator. – Michellemichels 11/5, 2013 at 14:54

Why it is so complex? Is it possible to write simpler? Why my rule does not work? – Histoplasmosis 11/5, 2013 at 19:9

@280Z28 can you provide an answer in your way, not including end line chars? – Histoplasmosis 11/5, 2013 at 19:14

When you use .* rule, it "eats" line breaks and thus matches everything to the end of stream, use following if you do not want to include end line chars: LINE_COMMENT: '#' ~[\r\n]*; – Ombre 11/5, 2013 at 22:24

@Ombre I have no line break characters at the end, see the code. I am parsing string variable. – Histoplasmosis 12/5, 2013 at 21:52

@Ombre do you mean it is impossible to match any symbol except by negative class? What is wrong with [ \t\r\n.]? Will just . match spaces? – Histoplasmosis 13/5, 2013 at 7:44

You do not need to mix \t\r\n and . because . matches everything anyway. If you want everything after pound use this: LINE_COMMENT: '#' .*; – Ombre 13/5, 2013 at 11:22

Recommended topics

Hot tags