Is it possible to parse big file with ANTLR?

Asked 8/5, 2013 at 4:8 Answered 10/5, 2013 at 22:22

Is it possible to instruct ANTLR not to load entire file into memory? Can it apply rules one by one and generate topmost list of nodes sequentially, along with reading file? Also may be it is possible to drop analyzed nodes somehow?

Farmstead answered 8/5, 2013 at 4:8 Comment(10)

Why do you need this? Arguably you can build a huge tree in memory, for sufficient memory. – Hispaniola 8/5, 2013 at 4:34

Obviously my file is bigger than memory – Farmstead 8/5, 2013 at 21:7

Yes? And just how big is your file? And if it is truly huge, how much value is there in processing it? If there's not enough value, there's little point in solving the problem. If there is sufficient value, big resources (time,space,custom tools) should be on the table as an option. – Hispaniola 8/5, 2013 at 21:26

My file is 100 gigabytes. I need to scan it and put results into the database. Anyway, you are taking discussion off the topic. Please answer if you know the answer. – Farmstead 8/5, 2013 at 21:38

I don't know specifically how to do this with ANTLR. In general, you could treat a single stream opened on the file as a really large sequence of small files, by having an ANTLR parser that parses one record, and declares effectively it has found EOF, and then passes the remainder of the stream to a new ANTLR instance. In fact, there's nothing special about this idea that requires ANTLR; you can do this with essentially any parsing engine. What is it about your file that requires parsing? – Hispaniola 8/5, 2013 at 23:21

Your scheme means that something else, not ANTLR, should split large file into parts. This is acceptable approach, but since splitting does also require parsing, which is ANTLR job, it is reasonable to think about how ANTRL could do it. Of course I can write all parsing myself, and ANTRL is not needed then. – Farmstead 8/5, 2013 at 23:34

My scheme suggests using whatever parsing machinery you need to recognize the record boundaries, which breaks up the file without physically partitioning it. If you don't need ANTLR's power, you can write your own. Usually humongous files containing records aren't very complex in record structure and an ad-hoc scheme does pretty well. – Hispaniola 8/5, 2013 at 23:56

How do you know what huge files usually are if you were not even believe they exist few posts ago??? – Farmstead 9/5, 2013 at 0:37

"Usually" are? I didn't say I didn't believe you had big files. I asked how big it was. (You can get machines with 100Gb RAM. You don't even need that much to process 100Gb files; we have run processes that used 90Gb in machines that only had 16Gb of RAM). – Hispaniola 9/5, 2013 at 1:2

@Ira Baxter, this is ideally what I wish to do, I want ANTLR to parse each record, then return EOF, essentially generating a new instance of the parser for each record. Although there might be some undesirable overhead for creating/destroying the ANTLR instance each time... it would be nice if I could just have a *.g4 grammar token effectively trigger the "reset of ANTLR4" when the grammar rule is encountered. – Closeknit 28/2, 2017 at 16:5

Yes, you can use:

UnbufferedCharStream for your character stream (passed to lexer)
UnbufferedTokenStream for your token stream (passed to parser)
- This token stream implementation doesn't differentiate on token channels, so make sure to use ->skip instead of ->channel(HIDDEN) as the command in your lexer rules that shouldn't be sent to the parser.
Make sure to call setBuildParseTree(false) on your parser or a giant parse tree will be created for the entire file.

Edit with some additional commentary:

I put quite a bit of work into making sure UnbufferedCharStream and UnbufferedTokenStream operate in the most "sane" manner possible, especially in relation to the mark, release, seek, and getText methods. My goal was to preserve as much of the functionality of those methods as possible without compromising the ability of the stream to release unused memory.
ANTLR 4 allows for true unlimited lookahead. If your grammar requires lookahead to EOF to make a decision, then you would not be able to avoid loading the entire input into memory. You'll have to take great care to avoid this situation when writing your grammar.

Tati answered 10/5, 2013 at 22:22 Comment(5)

You can avoid having to modify your grammar to use skip if you subclass UnbufferedTokenStream to skip hidden tokens (which I did in the fill method). It would be nice if UnbufferedTokenStream provided a filtering option. – Tommy 4/1, 2017 at 19:7

It appears to be ok to use setBuildParseTree(true) if you are incrementally parsing top-level syntax directly using methods on the generated parser. – Tommy 4/1, 2017 at 19:10

Finally, this answer appears to only apply to the Java backend. I see no comparable classes in the Python API, for example. – Tommy 4/1, 2017 at 19:13

I thought that my ANTLR4 listener callbacks won't be called if I setBuildParseTree(false).... although I haven't tried it yet. I switched to unbuffered streams and increased the VM to 8G with -Xmx8G and this seemed to get me into larger files successfully. I also unsuccessfully? tried the setTrimParseTree(true), as the comments in the source indicated this will prune the generated tree a bit. What I want to do is to have ANTLR behave as though it saw EOF at the end of each record, and discard the parse tree dataset after calling my listener(s). – Closeknit 28/2, 2017 at 16:10

I tried those unbuffered streams: Using only UnbufferedTokenStream improved performance. Also using UnbufferedCharStream lead to an UnsupportedOperationException, and to avoid it I added this: lexer.setTokenFactory(new CommonTokenFactory(true));. This, however, made the performance worse. Any further hint appreciated. – Spoilt 26/10, 2018 at 8:12

There is a Wiki page buried somewhere on Antlr.org that speaks to your question; cannot seem to find in just now.

In substance, the lexer reads data using a standard InputStream interface, specifically ANTLRInputStream.java. The typical implementation is ANTLRFileStream.java that preemptively reads the entire input data file into memory. What you need to do is to write your own buffered version -"ANTLRBufferedFileStream.java"- that reads from the source file as needed. Or, just set a standard BufferedInputStream/FileInputStream as the data source to the AntlrInputStream.

One caveat is that Antlr4 has the potential for doing an unbounded lookahead. Not likely a problem for a reasonably sized buffer in normal operation. More likely when the parser attempts error recovery. Antlr4 allows for tailoring of the error recovery strategy, so the problem is manageable.

Additional detail:

In effect, Antlr implements a pull-parser. When you call the first parser rule, the parser requests tokens from the lexer, which requests character data from the input stream. The parser/lexer interface is implemented by a buffered token stream, nominally BufferedTokenStream.

The parse tree is little more than a tree data structure of tokens. Well, a lot more, but not in terms of data size. Each token is an INT value backed typically by a fragment of the input data stream that matched the token definition. The lexer itself does not require a full copy of the lex'd input character stream to be kept in memory. And, the token text fragments could be zero'd out. The critical memory requirement for the lexer is the input character stream lookahead scan, given a buffered file input stream.

Depending on your needs, the in-memory parse tree can be small even given a 100GB+ input file.

To help further, you need to explain more what it is you are trying to do in Antlr and what defines your minimum critical memory requirement. That will guide which additional strategies can be recommended. For example, if the source data is amenable, you can use multiple lexer/parser runs, each time subselecting in the lexer different portions of the source data to process. Compared to file reads and DB writes, even with fast disks, Antlr execution will likely be barely noticeable.

Paraformaldehyde answered 9/5, 2013 at 1:20 Comment(6)

Don't understand, how reimplementing stream can help. Doesn't ANTLR create entire tree in memory? This is the problem I think. – Farmstead 9/5, 2013 at 8:6

Can you confirm, that your approach can help to free memory from character stream only, while the entire tree will still be in memory? – Farmstead 10/5, 2013 at 9:23

In the general case, yes it will help. In your particular case, impossible to say if it will or by how much, since you have provided no additional information. – Paraformaldehyde 10/5, 2013 at 19:50

My file is very big XML file. I have already implemented approach when another parser reads toplevel portions. My question persists: is it possible to purge tree nodes from memory? – Farmstead 10/5, 2013 at 20:51

If your file is really XML, then much of its bulk is wasted space in blanks and long, repeated tag names, and text-versions of values in your recored (e.g. numbers and strings). A well-defined AST representing your file with any bit of effort to represent the tags and textifed-values as reasonable binary data is likely to have a memory footprint much smaller than the raw XML text. So, your 100 GB XML file may go into 10 GB of VM space. Machines with 4-8 GB of RAM can virtually page 10 GB address spaces. Are you sure you can't just read this file with a standard XML reader? Why not? – Hispaniola 11/5, 2013 at 17:40

@GRosenBerg -I'm getting closer to understanding. I found "Start rules don't necessarily have to consume all input", and I understand that, what I don't understand is how to manage char/token stream position pointers if I give ANTLR4 rules that consume portions of the input. I will need some recursion on a portion of my rule set, and don't have a clear image of how to manage token stream positions..yet. – Closeknit 28/2, 2017 at 16:23

Recommended topics

Hot tags