pypeg2 - can this expression be parsed using peg grammar?
Asked Answered
L

1

8

I need to parse expressions based on following rules:

  1. An expression can contain a filter object represented as name:value
  2. An expression can contain a string expression
  3. An expression can contain Booleans OR,AND
  4. Everything inside can be quoted

So a typical expression looks like

filter1:45 hello world filter:5454

filter1:45 'hello world' filter:5454

hello world

'hello world' OR filter:43


Here's what I've tried so far:

class BooleanLiteral(Keyword):
    grammar = Enum(K("OR"), K("AND"))

class LineFilter(Namespace):
    grammar = flag('inverted', "-"), name(), ":", attr('value', word)

class LineExpression(List):
    grammar = csl(LineFilter, separator=blank)

With this grammar, I can parse strings like

filter2:32 filter1:3243

From what I understood I can provide csl function with a list of objects, and the grammar needs to be in that order. However what if I want to parse an object like

filter34:43 hello filter32:3232

OR

filter34:43 OR filter32:3232

How can I say that there are multiple types of objects (filters, expressions, booleans) in an expression? Is that possible with peg?

Lafrance answered 26/11, 2015 at 10:25 Comment(5)
This can definitely be done - just need to clarify a couple of things. Is order of filters, booleans and literals entirely unimportant? i.e. can you have any number of them in any order? Secondly - when you say anything can be quoted - does this mean that filters or booleans can be quoted as well as string literals? Is the normal separator spaces (it seems to be). If spaces are the separator, are quoted strings with spaces counted as one token, or several? e.g. is hello world two tokens (hello and world), or one? How do you answer the same question for "hello world"?Ellipsoid
Hi, thanks for your comment, 1) The only restriction that makes sense is that a boolean cannot be next to each other, but need to separate filter OR filter, expression OR expression or expression OR filter. Otherwise there is any number of filters and expressions, and there is no order specified. 2) string literals and booleans can be quoted, not filters 3) Space is a normal separator. 4) quoted string = one token. so hello world is hello world while "hello world" is hello world. Did I answer all of your questions?Lafrance
That all makes sense and answers the questions - I'll try to have a look at it when I can.Ellipsoid
Ooh - one more thing occurs to me - are you on PyPEG 2.15? I presume so.Ellipsoid
@JRichardSnape yep it's version: 2.15.0Lafrance
E
8

From your spec in the question and comments, I think your code is close - but you don't want the csl. I've put the code I think you want below (it may not be the most elegant implementation, but I think it's reasonable). You have to avoid a potential problem that BooleanLiteral is a subset of StringLiteral. This meant that you can't make the LineExpression have

grammar = maybe_some([LineFilter,StringLiteral]), optional(BooleanLiteral)

The result is a list of objects with the correct types according to your spec, I think. I think the crucial bit to emphasise is that you can put in alternatives as a python list (i.e. [LineFilter,StringLiteral] means a LineFilter or a StringLiteral). The PEG parser will try them in the order they occur, i.e. it will try to match the first and only if it fails will it try the second and so on.

from pypeg2 import *

class BooleanLiteral(Keyword):
    # Need to alter keyword regex to allow for quoted literal keywords
    K.regex=re.compile(r'"*\w+"*') 
    grammar = Enum(K('OR'), K('AND'),K(r'"OR"'), K(r'"AND"')) 

class LineFilter(Namespace):
    grammar = flag('inverted', "-"), name(), ":", attr('value', word)

class StringLiteral(str):
     quoted_string = re.compile(r'"[^"]*"')
     grammar = [word, quoted_string]

class LineExpression(List):
    grammar = maybe_some([(LineFilter,BooleanLiteral),
                          (StringLiteral,BooleanLiteral),
                          LineFilter,
                          StringLiteral])

test_string = ('filter34:43 "My oh my!!" Hello OR '
               'filter32:3232 "AND" "Goodbye cruel world"')

k = parse(test_string,LineExpression)

print('Input:')
print(test_string)
print('Parsed output:')
print('==============')
for component in k:
    print(component,type(component))

Output

Input:
filter34:43 "My oh my!!" Hello OR filter32:3232 "AND" "Goodbye cruel world"
Parsed output:
==============
LineFilter([], name=Symbol('filter34')) <class '__main__.LineFilter'>
"My oh my!!" <class '__main__.StringLiteral'>
Hello <class '__main__.StringLiteral'>
OR <class '__main__.BooleanLiteral'>
LineFilter([], name=Symbol('filter32')) <class '__main__.LineFilter'>
"AND" <class '__main__.BooleanLiteral'>
"Goodbye cruel world" <class '__main__.StringLiteral'>
Ellipsoid answered 7/12, 2015 at 12:41 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.