Parsing a "simple" grammar
Asked Answered
R

1

12

Sorry in advance; I am sure this question will seem almost idiotic to people who are used to playing with parsers and grammars, but these are foreign topics to me, and this is my attempt at stepping gently into a practical case requiring them.

I would like to write a parser for the following "language", which contains a single "special structure" looking like this:

\command[ options ]{ contents }

The contents can be anything, including nested commands, and may contain escaped brackets or backslashes \{ \} \\. I realise that "anything" is not specific, but ideally they should be determined by matching brackets (excluding escaped ones), if possible.

The options should be a comma-separated list of assignment expressions such as name = value, but the value may be a quoted string containing = or , characters. Finally the previous name and command should validate the regular expression \w[\w\d\._-+*]* -- that is, the first character should be a letter, and the remaining characters should be a letter, digit or one of . _ - + *.

Writing this with regular expressions seems overly complicated (for instance, because values may contain quoted characters , =, which would otherwise separate assignments or name/value pairs). So I think the most appropriate tool here is a grammar, but despite superficial readings, I am just not sure how to write it (BNF, PEG, etc?), which type of parsers to use (LR, recursive decent, etc?), and how I could use the parsing output in a practical program.

I would prefer answers with Python, which explains the tag, but of course I would be perfectly happy with a combination of tools if necessary/better suited.


NOTE: this is NOT about LaTeX. I realise the similarity of course, but LaTeX is extremely more complex than the previous language, for instance with character codes varying depending on the context. I am merely asking for a practical example which (I think) is simple enough for SO, and yet would already be useful to me in my daily work.

Revenuer answered 11/4, 2017 at 15:5 Comment(8)
Is this (La)TeX?Carnauba
No :) LaTeX is much more complicated, with character codes, @-statements, etc. In a way this is a very very strong restriction of LaTeX. I am asking mainly because I want to learn on a case which can already be useful at work, and which (I think) is simple enough for an answer on SO.Revenuer
So refreshing to read a question with parsing in the title that's actually about parsing.Confluent
"The contents can be anything" doesn't give much to go on from the point of view of writing a parser.Borscht
@JohnColeman What I mean is that the contents should be determined by matching bracket pairs, but ignoring escaped brackets. Does it need to be more specific for a grammar?Revenuer
If the contents have an internal structure then you need to spell it out a bit, especially if you are going to use a parser-generator tool which is expecting a grammar for input.Borscht
A long time ago I was looking at simpleparse.sourceforge.net to parse SQL. I gave up, because it a) wasn't high priority and b) it wasn't super well documented and I couldn't quickly figure how it would handle reserved words like AND, FROM, etc... But it did look fairly powerful nevertheless.Seabrooke
Could you post some example strings? For instance, it's not clear from your description if the "[]"s are actually in the input string, or just your own notation that these parts of the command are optional.Mellie
M
7

Start by expressing your grammar more formally, in whatever notation you prefer. e.g., from your description, an EBNF would be like this:

program := element+
element := command | literal
literal := (not '\')+

command := '\'identifier options? '{' program '}'
options := option | options ',' option
option  := identifier '=' value
value   := number | string

string  := '"' (escape | not '\' or '"')* '"'
escape  : = '\' char

Then either feed this to a parser generator (pyParsing, pyYACC, ANTLR) or write a parser by hand. In the latter case top-down is the simplest option: start from the top of the grammar and convert each rule to a function which will either return a parsed AST node and consume the input or return nothing or throw. Example:

 def program():
    elements = []
    while next_sym():
        elements.append(element())
    return {'type': 'program', 'children': elements}

 def element():
     return command() or literal()

 def command():
     if next_sym() == '\\':
         get_sym()
         ...parse command here
         return {'type': 'command', 'children': ...}
     return None 

where next_sym returns the next symbol from the input (or None on EOF) and get_sym consumes the symbol and advances the input buffer.

Murraymurre answered 12/4, 2017 at 8:30 Comment(2)
Thanks a lot; in this example, are identifier, number & char primitives, or should I define them as you defined the rest? Regarding the string definition, the part not '\' or '"' forces blackslashes and double quotes to be escaped, is that correct?Revenuer
Depends on which method you're using, some generators provide primitives for that, others not, so you'll have to define them using regexes.Murraymurre

© 2022 - 2024 — McMap. All rights reserved.