Does Pyparsing Support Context-Sensitive Grammars?
Asked Answered
S

2

6

Forgive me if I have the incorrect terminology; perhaps just getting the "right" words to describe what I want is enough for me to find the answer on my own.

I am working on a parser for ODL (Object Description Language), an arcane language that as far as I can tell is now used only by NASA PDS (Planetary Data Systems; it's how NASA makes its data available to the public). Fortunately, PDS is finally moving to XML, but I still have to write software for a mission that fell just before the cutoff.

ODL defines objects in something like the following manner:

OBJECT              = TABLE
  ROWS              = 128
  ROW_BYTES         = 512 
END_OBJECT          = TABLE

I am attempting to write a parser with pyparsing, and I was doing fine right up until I came to the above construction.

I have to create some rule that is able to ensure that the right-hand-value of the OBJECT line is identical to the RHV of END_OBJECT. But I can't seem to put that into a pyparsing rule. I can ensure that both are syntactically valid values, but I can't go the extra step and ensure that the values are identical.

  1. Am I correct in my intuition that this is a context-sensitive grammar? Is that the phrase I should be using to describe this problem?
  2. Whatever kind of grammar this is in the theoretical sense, is pyparsing able to handle this kind of construction?
  3. If pyparsing is not able to handle it, is there another Python tool capable of doing so? How about ply (the Python implementation of lex/yacc)?
Sheathbill answered 27/2, 2013 at 2:23 Comment(0)
S
6

It is in fact a grammar for a context-sensitive language, classically abstracted as wcw where w is in (a|b)* (note that wcw' , where ' indicates reversal, is context-free).

Parsing Expression Grammars are capable of parsing wcw-type languages by using semantic predicates. PyParsing provides the matchPreviousExpr() and matchPreviousLiteral() helper methods for this very purpose, e.g.

w = Word("ab")
s = w + "c" + matchPreviousExpr(w)

So in your case you'd probably do something like

table_name = Word(alphas, alphanums)
object = Literal("OBJECT") + "=" + table_name + ... +
  Literal("END_OBJECT") + "=" +matchPreviousExpr(table_name)
Saveloy answered 27/2, 2013 at 4:23 Comment(0)
I
3

As a general rule, parsers are built as context-free parsing engines. If there is context sensitivity, it is grafted on after parsing (or at least after the relevant parsing steps are completed).

In your case, you want to write context-free grammar rules:

  head = 'OBJECT' '=' IDENTIFIER ;
  tail = 'END_OBJECT'  '=' IDENTIFIER ;
  element = IDENTIFIER '=' value ;
  element_list = element ;
  element_list = element_list element ;
  block = head element_list tail ;

The checks that the head and tail constructs have matching identifiers isn't technically done by the parser.

Many parsers, however, allow a semantic action to occur when a syntactic element is recognized, often for the purpose of building tree nodes. In your case, you want to use this to enable additional checking. For element, you want to make sure the IDENTIFIER isn't a duplicate of something already in the block; this means for each element encountered, you'll want to capture the corresponding IDENTIFIER and make a block-specific list to enable duplicate checking. For block, you want to capture the head *IDENTIFIER*, and check that it matches the tail *IDENTIFIER*.

This is easiest if you build a tree representing the parse as you go along, and hang the various context-sensitive values on the tree in various places (e.g., attach the actual IDENTIFIER value to the tree node for the head clause). At the point where you are building the tree node for the tail construct, it should be straightforward to walk up the tree, find the head tree, and then compare the identifiers.

This is easier to think about if you imagine the entire tree being built first, and then a post-processing pass over the tree is used to this checking. Lazy people in fact do it this way :-} All we are doing is pushing work that could be done in the post processing step, into the tree-building steps attached to the semantic actions.

None of these concepts is python specific, and the details for PyParsing will vary somewhat.

Incudes answered 27/2, 2013 at 3:47 Comment(4)
In pyparsing, you can attach a parse-time callback (we call it a "parse action") to the expression corresponding to block, and it can validate that the head and tail values agree, or raise a ParseException. So you don't have to do a lot of "tree-walking".Venita
So this is exactly the semantic action I discussed. Does it build the tree itself? How? If not, how is that you can check the value of the terminals?Incudes
pyparsing passes the parsed structure to the callback, and the callback in this case would look at the first and last elements to compare the object value strings. So I guess there is some tree-walking, but only over just the current object that was parsed. And with the addition of some labelling (similar to naming groups in a regular expression), the parse action could look directly at tokens.head.identifier and tokens.tail.identifier to validate that they match.Venita
After I posted this question I decided to use the strategy like above. My intuition tells me that since objects are never interleaved, only nested, the simple grammar would parse correctly and I could just assert that the values were the same as an afterthought. But I'm not sure I can prove that. (All the more motivation to go back to school, I suppose.)Sheathbill

© 2022 - 2024 — McMap. All rights reserved.