ANTLR4 grammar token recognition error after import
Asked Answered
E

1

17

I am using a parser grammar and a lexer grammar for antlr4 from GitHub to parse PHP in Python3.

When I use these grammars directly my PoC code works:

antlr-test.py

from antlr4 import *
# from PHPParentLexer import PHPParentLexer
# from PHPParentParser import PHPParentParser
# from PHPParentParser import PHPParentListener

from PHPLexer import PHPLexer as PHPParentLexer
from PHPParser import PHPParser as PHPParentParser
from PHPParser import PHPParserListener as PHPParentListener


class PhpGrammarListener(PHPParentListener):
    def enterFunctionInvocation(self, ctx):
        print("enterFunctionInvocation " + ctx.getText())


if __name__ == "__main__":
    scanner_input = FileStream('test.php')
    lexer = PHPParentLexer(scanner_input)
    stream = CommonTokenStream(lexer)
    parser = PHPParentParser(stream)
    tree = parser.htmlDocument()
    walker = ParseTreeWalker()
    printer = PhpGrammarListener()
    walker.walk(printer, tree)

which gives the output

/opt/local/bin/python3.4 /Users/d/PycharmProjects/name/antlr-test.py
enterFunctionInvocation echo("hi") 
enterFunctionInvocation another_method("String")
enterFunctionInvocation print("print statement")

Process finished with exit code 0

When I use the following PHPParent.g4 grammar, I get a lot of errors:

grammar PHPParent;
options { tokenVocab=PHPLexer; }
import PHPParser;

After swapping comments on pythons imports, I get this error

/opt/local/bin/python3.4 /Users/d/PycharmProjects/name/antlr-test.py
line 1:1 token recognition error at: '?'
line 1:2 token recognition error at: 'p'
line 1:3 token recognition error at: 'h'
line 1:4 token recognition error at: 'p'
line 1:5 token recognition error at: '\n'
...
line 2:8 no viable alternative at input '<('
line 2:14 mismatched input ';' expecting {<EOF>, '<', '{', '}', ')', '?>', 'list', 'global', 'continue', 'return', 'class', 'do', 'switch', 'function', 'break', 'if', 'for', 'foreach', 'while', 'new', 'clone', '&', '!', '-', '~', '@', '$', <INVALID>, 'Interface', 'abstract', 'static', Array, RequireOperator, DecimalNumber, HexNumber, OctalNumber, Float, Boolean, SingleQuotedString, DoubleQuotedString_Start, Identifier, IncrementOperator}
line 3:28 mismatched input ';' expecting {<EOF>, '<', '{', '}', ')', '?>', 'list', 'global', 'continue', 'return', 'class', 'do', 'switch', 'function', 'break', 'if', 'for', 'foreach', 'while', 'new', 'clone', '&', '!', '-', '~', '@', '$', <INVALID>, 'Interface', 'abstract', 'static', Array, RequireOperator, DecimalNumber, HexNumber, OctalNumber, Float, Boolean, SingleQuotedString, DoubleQuotedString_Start, Identifier, IncrementOperator}
line 4:28 mismatched input ';' expecting {<EOF>, '<', '{', '}', ')', '?>', 'list', 'global', 'continue', 'return', 'class', 'do', 'switch', 'function', 'break', 'if', 'for', 'foreach', 'while', 'new', 'clone', '&', '!', '-', '~', '@', '$', <INVALID>, 'Interface', 'abstract', 'static', Array, RequireOperator, DecimalNumber, HexNumber, OctalNumber, Float, Boolean, SingleQuotedString, DoubleQuotedString_Start, Identifier, IncrementOperator}

However I get no errors when running the antlr4 tool over the grammars. I'm stumped here - what could be causing this issue?

$ a4p PHPLexer.g4
warning(146): PHPLexer.g4:363:0: non-fragment lexer rule DoubleQuotedStringBody can match the empty string
$ a4p PHPParser.g4
warning(154): PHPParser.g4:523:0: rule doubleQuotedString contains an optional block with at least one alternative that can match an empty string
$ a4p PHPParent.g4
warning(154): PHPParent.g4:523:0: rule doubleQuotedString contains an optional block with at least one alternative that can match an empty string
Eskill answered 14/4, 2015 at 14:28 Comment(7)
Does your Grammar PHPParent only consist of three lines? If not: Complete the grammar.Eudoca
It does - I wanted to test importing grammars in isolation.Eskill
Yet, I can only say that java does not allow a grammar without rules. I added a pseudo rule: myfile : file; and it compiled (and calling myfile instead of file on the parser object). Yet I did not test the parser, because I do not have a python environment. Have you tried it using such a delegator rule?Eudoca
Thanks - I hope to give this a try tomorrow. I'll update you whether it works; if it does make this an answer & I'll accept it.Eskill
Unfortunately this is still erroring out - "mismatched input '<?php echo('hi') ?>' expecting <INVALID>"Eskill
Try to import the Lexer. The error message line 1:1 token recognition error at: comes from the lexer and means that their is no matching lexer rule. From your post I cannot see what contents are in the PHPParentLexer, perhaps the problems lies there.Eudoca
This old post on the antlr mailing list may be of use? It mentions that to use the tokenVocab option, you need to have a tokens file in the same directory as the grammarSperoni
U
5

Import is ANTLR4 is kind of messy.

First, tokenVocab can not generate the lexer you need. It just means that this grammar is using the tokens of PHPLexer. If you delete PHPLexer.tokens, it won't even compile!

Take a look at PHPParser.g4 where we also use options { tokenVocab=PHPLexer; }. Yet in the python script we still need to use lexer from PHPLexer to make it work. Well, this PHPParentLexer is not useable at all. That's why you got all the error.

To generate a new lexer out of combined grammar, you need to import it like this:

grammar PHPParent;
import PHPLexer;

However, mode is not supported when importing. PHPLexer itself uses mode a lot. So it's also not an option.

Can we simply replace PHPParentLexer with PHPLexer? Sadly, no. Because PHPParentParser is generated with PHPParentLexer, they are tightly coupled and can not be used seperatly. If you use PHPLexer, PHPParentParser also won't work. As for this grammar, thanks to the error recovery, it actually works, but gives some error.

There seems to be no better way but to rewrite some of the grammar. There are definitely some design issues in this import part of ANTLR4.

Unfavorable answered 30/4, 2015 at 4:39 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.