My ultimate goal is to parse a structured file as a tree of in-memory objects that I can then manipulate. The file format that I'm using is fairly sophisticated with about 200 keywords/tags, and this seemed like a good reason to learn about parser/lexer frameworks.
Unfortunately, there are so many concepts (and hundreds of tutorials and guides) that the learning process so far feels like trying to drink from a fire hose. So I'm taking some very meager baby steps, starting with this example.
I modified the grammar to create the following test, Nano.g4:
grammar Nano;
r : root ;
root : START ROOT ID END ROOT;
START : 'StartBlock' ;
END : 'EndBlock' ;
ROOT : 'RootItem' ;
ID : [a-z]+ ; // match lower-case identifiers
WS : [ \t\r\n]+ -> skip ; // skip spaces, tabs, newlines
Next, I created a simple input file, nano.txt:
StartBlock RootItem
foo
EndBlock RootItem
I then loaded the code using the following commands:
del *.class
del *.java
java org.antlr.v4.Tool Nano.g4
javac nano*.java
java org.antlr.v4.runtime.misc.TestRig Nano r -gui < nano.txt
That gives me this result:
The tree above is my first conceptual hangup about what to expect from a lexer and parser. The "StartBlock RootItem" and "EndBlock RootItem" tokens are necessary in terms of making the input file legal, but conceptually I don't need them after I've proven that the file is properly formatted. The only thing that I care about from that point on is that there's a RootItem that contains "foo", as shown here:
Again, I'm painfully new to parser/lexer concepts. Should I (or, is it even possible to) write the grammar so the output tree matches the image above? Or should I take care of that in some subsequent step that traverses the AST and only extracts the relevant data fields?