Parsing a "simple" grammar

Sorry in advance; I am sure this question will seem almost idiotic to people who are used to playing with parsers and grammars, but these are foreign topics to me, and this is my attempt at stepping gently into a practical case requiring them.

I would like to write a parser for the following "language", which contains a single "special structure" looking like this:

\command[ options ]{ contents }

The contents can be anything, including nested commands, and may contain escaped brackets or backslashes \{ \} \\. I realise that "anything" is not specific, but ideally they should be determined by matching brackets (excluding escaped ones), if possible.

The options should be a comma-separated list of assignment expressions such as name = value, but the value may be a quoted string containing = or , characters. Finally the previous name and command should validate the regular expression \w[\w\d\._-+*]* -- that is, the first character should be a letter, and the remaining characters should be a letter, digit or one of . _ - + *.

Writing this with regular expressions seems overly complicated (for instance, because values may contain quoted characters , =, which would otherwise separate assignments or name/value pairs). So I think the most appropriate tool here is a grammar, but despite superficial readings, I am just not sure how to write it (BNF, PEG, etc?), which type of parsers to use (LR, recursive decent, etc?), and how I could use the parsing output in a practical program.

I would prefer answers with Python, which explains the tag, but of course I would be perfectly happy with a combination of tools if necessary/better suited.

NOTE: this is NOT about LaTeX. I realise the similarity of course, but LaTeX is extremely more complex than the previous language, for instance with character codes varying depending on the context. I am merely asking for a practical example which (I think) is simple enough for SO, and yet would already be useful to me in my daily work.

program := element+ element := command | literal literal := (not '\')+ command := '\'identifier options? '{' program '}' options := option | options ',' option option := identifier '=' value value := number | string string := '"' (escape | not '\' or '"')* '"' escape : = '\' char

def program(): elements = [] while next_sym(): elements.append(element()) return {'type': 'program', 'children': elements} def element(): return command() or literal() def command(): if next_sym() == '\\': get_sym() ...parse command here return {'type': 'command', 'children': ...} return None

Recommended topics

Hot tags