ANTLR4 Python parsing big files
Asked Answered
P

2

30

I am trying to write parsers for juniper/srx router access control lists. Below is the grammar I am using:

grammar SRXBackend;

acl:
    'security' '{' 'policies' '{' COMMENT* replaceStmt '{' policy* '}' '}' '}'
            applications
            addressBook
;

replaceStmt:
    'replace:' IDENT
|   'replace:' 'from-zone' IDENT 'to-zone' IDENT
;

policy:
    'policy' IDENT '{' 'match' '{' fromStmt* '}' 'then' (action | '{' action+ '}') '}'
;

fromStmt:
     'source-address' addrBlock                     # sourceAddrStmt
|    'destination-address' addrBlock                # destinationAddrStmt
|    'application' (srxName ';' | '[' srxName+ ']')  # applicationBlock
;

action:
    'permit' ';'
|   'deny' ';'
|   'log { session-close; }'
;

addrBlock:
    '[' srxName+ ']'
|   srxName ';'
;

applications:
    'applications' '{' application* '}'
|   'applications' '{' 'apply-groups' IDENT ';' '}' 'groups' '{' replaceStmt  '{' 'applications' '{' application* '}' '}' '}'
;

addressBook:
    'security' '{' 'address-book' '{' replaceStmt '{' addrEntry* '}' '}' '}'
|   'groups' '{' replaceStmt  '{' 'security' '{' 'address-book' '{' IDENT '{' addrEntry* '}' '}' '}' '}' '}' 'security' '{' 'apply-groups' IDENT ';' '}'
;

application:
    'replace:'? 'application' srxName '{' applicationStmt+ '}'
;

applicationStmt:
    'protocol' srxName ';'            #applicationProtocol
|   'source-port' portRange ';'       #applicationSrcPort
|   'destination-port' portRange ';'  #applicationDstPort
;

portRange:
    NUMBER             #portRangeOne
|   NUMBER '-' NUMBER  #portRangeMinMax
;

addrEntry:
    'address-set' IDENT '{' addrEntryStmt+ '}' #addrEntrySet
|   'address' srxName cidr ';'                 #addrEntrySingle
;

addrEntryStmt:
    ('address-set' | 'address') srxName ';'
;

cidr:
    NUMBER '.' NUMBER '.' NUMBER '.' NUMBER ('/' NUMBER)?
;

srxName:
    NUMBER
|   IDENT
|   cidr
;

COMMENT : '/*' .*? '*/' ;
NUMBER  : [0-9]+ ;
IDENT   : [a-zA-Z][a-zA-Z0-9,\-_:\./]* ;
WS      : [ \t\n]+ -> skip ;

When I try to use an ACL with ~80,000 lines, it takes upto ~10 minutes to generate the parse tree. I am using following code for creating the parse tree:

from antlr4 import *
from SRXBackendLexer import SRXBackendLexer
from SRXBackendParser import SRXBackendParser
import sys


    def main(argv):
        ipt = FileStream(argv[1])
        lexer = SRXBackendLexer(ipt)
        stream = CommonTokenStream(lexer)
        parser = SRXBackendParser(stream)
        parser.acl()

    if __name__ == '__main__':
        main(sys.argv)

I am using Python 2.7 as target language. I also ran cProfile to identify which code takes most time. Below is the first few records sorted on time:

ncalls  tottime  percall  cumtime  percall filename:lineno(function)
   608448   62.699    0.000  272.359    0.000 LexerATNSimulator.py:152(execATN)
  5007036   41.253    0.000   71.458    0.000 LexerATNSimulator.py:570(consume)
  5615722   32.048    0.000   70.416    0.000 DFAState.py:131(__eq__)
 11230968   24.709    0.000   24.709    0.000 InputStream.py:73(LA)
  5006814   21.881    0.000   31.058    0.000 LexerATNSimulator.py:486(captureSimState)
  5007274   20.497    0.000   29.349    0.000 ATNConfigSet.py:160(__eq__)
 10191162   18.313    0.000   18.313    0.000 {isinstance}
 10019610   16.588    0.000   16.588    0.000 {ord}
  5615484   13.331    0.000   13.331    0.000 LexerATNSimulator.py:221(getExistingTargetState)
  6832160   12.651    0.000   12.651    0.000 InputStream.py:52(index)
  5007036   10.593    0.000   10.593    0.000 InputStream.py:67(consume)
   449433    9.442    0.000  319.463    0.001 Lexer.py:125(nextToken)
        1    8.834    8.834   16.930   16.930 InputStream.py:47(_loadString)
   608448    8.220    0.000  285.163    0.000 LexerATNSimulator.py:108(match)
  1510237    6.841    0.000   10.895    0.000 CommonTokenStream.py:84(LT)
   449432    6.044    0.000  363.766    0.001 Parser.py:344(consume)
   449433    5.801    0.000    9.933    0.000 Token.py:105(__init__)

I cannot really make much sense out of it except InputStream.LA takes around half a minute. I guess this is due to the fact that the entire text string gets buffered/loaded at once. Is there any alternative/more lazy way of parsing or loading data for Python target? Is there any improvement I can make to the grammar to have the parsing faster?

Thank you

Putrescine answered 10/3, 2016 at 23:30 Comment(13)
It is not an answer, but have you tried use PyPy or anything else? Just to know how much of a burden falls on python?Shriek
I haven't used PyPy but did some more research since yesterday. Seems that ANTLR's input stream class converts the entire text input into a byte buffer character by character. That seems to take upto over a minute. Is there a faster way to do this? I am sure I can override input stream's implementation as long as I can find a better way of doing this.Putrescine
@prthrokz, I'd advice you to try the "old" Antlr 3. Antlr 4 tries to parse just about every grammar but has to put in excessively much runtime effort to parse even very simple grammars. Antlr 3 is more restrictive, but fast.Fishing
@Kay: I read somewhere that we can fallback to ANTLR3 based SLL(*) parsing by setting a flag on the parser. Any idea if Python target supports that feature ?Putrescine
@prthrokz, sorry, I have no idea.Fishing
@Putrescine Are you using 4.5.3 runtime? If not, try switching to it and see if it makes any difference. I read somewhere that previous release(s) had some bug resulting in slow parsing, and this was fixed in 4.5.3 released on March 30th 2016 on PyPi - pypi.python.org/pypi/antlr4-python2-runtimeSalamis
I had a big performance problem with an Ada parser using AntLR. Running it with pypy divided running times by 10 or more. I recommend that approach if you cannot fix it by another means.Issiah
Could you also post the Lexer definition?Karmenkarna
Not a plug... although I'm the main author. If the target language is Python, you could consider using Grako. The example in examples/antlr2grako will translate the ANTLR grammar to Grako (e-ebnf) format (better if you include the lexical definitions in the same .g file).Karmenkarna
@Apalala, the tokens are at the bottom of the grammar.Melodize
Please try this change: portRange: NUMBER ('-' NUMBER)?;Karmenkarna
It is my understanding that your IDENT can be zero-sized due to * instead of +. This would send your parser into all sorts of funny loops for every single character...Levitation
Is your question still relevant? If so, do you have a sample ACL file?Penitential
L
1

It is my understanding that your IDENT can be zero-sized due to * instead of +. This sends your parser into loops for every single character, generating zero-sized IDENT nodes.

Levitation answered 22/7, 2017 at 21:44 Comment(1)
His rule is IDENT : [a-zA-Z][a-zA-Z0-9,\-_:\./]*. It can't be 0 sized. It has to have a letter (either case) followed by 0 or more of the second set. (It's easy to see th trailing *, but miss that it's two separate sets.Counterproof
D
-1

Try this:

# Install ANTLR and the Python runtime for ANTLR
!pip install antlr4-python3-runtime

# Generate a parser from the grammar
!antlr4 -Dlanguage=Python SRXBackend.g4

# Import the generated parser and lexer
from SRXBackendParser import SRXBackendParser
from SRXBackendLexer import SRXBackendLexer

# Read the input file
with open("input.txt", "r") as file:
    input_str = file.read()

# Create a lexer and parser
lexer = SRXBackendLexer(InputStream(input_str))
stream = CommonTokenStream(lexer)
parser = SRXBackendParser(stream)

# Parse the input
tree = parser.acl()
Dysphasia answered 5/1, 2023 at 17:58 Comment(2)
Make sure you replace the input.txt with the grammar fileDysphasia
Your answer could be improved with additional supporting information. Please edit to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers in the help center.Audwin

© 2022 - 2024 — McMap. All rights reserved.