Simply using parsec in python
Asked Answered
G

4

13

I'm looking at this library, which has little documentation: https://pythonhosted.org/parsec/#examples

I understand there are alternatives, but I'd like to use this library.

I have the following string I'd like to parse:

mystr = """
<kv>
  key1: "string"
  key2: 1.00005
  key3: [1,2,3]
</kv>
<csv>
date,windspeed,direction
20190805,22,NNW
20190805,23,NW
20190805,20,NE
</csv>"""

While I'd like to parse the whole thing, I'd settle for just grabbing the <tags>. I have:

>>> import parsec
>>> tag_start = parsec.Parser(lambda x: x == "<")
>>> tag_end = parsec.Parser(lambda x: x == ">")
>>> tag_name = parsec.Parser(parsec.Parser.compose(parsec.many1, parsec.letter))
>>> tag_open = parsec.Parser(parsec.Parser.joint(tag_start, tag_name, tag_end))

OK, looks good. Now to use it:

>>> tag_open.parse(mystr)
Traceback (most recent call last):
...
TypeError: <lambda>() takes 1 positional argument but 2 were given

This fails. I'm afraid I don't even understand what it meant about my lambda expression giving two arguments, it's clearly 1. How can I proceed?

My optimal desired output for all the bonus points is:

[
{"type": "tag", 
 "name" : "kv",
 "values"  : [
    {"key1" : "string"},
    {"key2" : 1.00005},
    {"key3" : [1,2,3]}
  ]
},
{"type" : "tag",
"name" : "csv", 
"values" : [
    {"date" : 20190805, "windspeed" : 22, "direction": "NNW"}
    {"date" : 20190805, "windspeed" : 23, "direction": "NW"}
    {"date" : 20190805, "windspeed" : 20, "direction": "NE"}
  ]
}

The output I'd settle for understanding in this question is using functions like those described above for start and end tags to generate:

[
  {"tag": "kv"},
  {"tag" : "csv"}
]

And simply be able to parse arbitrary xml-like tags out of the messy mixed text entry.

Gorgias answered 6/8, 2019 at 4:18 Comment(3)
For starters, the lambda needs to take two arguments and return Value.success() or Value.failure() as mentioned here.Resemblance
But fn must accept two arguments. It is passed the entire text and current index.Inaugural
Here you see examples of the usage: github.com/sighingnow/parsec.py/blob/master/tests/…Inaugural
H
19

I encourage you to define your own parser using those combinators, rather than construct the Parser directly.

If you want to construct a Parser by wrapping a function, as the documentation states, the fn should accept two arguments, the first is the text and the second is the current position. And fn should return a Value by Value.success or Value.failure, rather than a boolean. You can grep @Parser in the parsec/__init__.py in this package to find more examples of how it works.

For your case in the description, you could define the parser as follows:

from parsec import *

spaces = regex(r'\s*', re.MULTILINE)
name = regex(r'[_a-zA-Z][_a-zA-Z0-9]*')

tag_start = spaces >> string('<') >> name << string('>') << spaces
tag_stop = spaces >> string('</') >> name << string('>') << spaces

@generate
def header_kv():
    key = yield spaces >> name << spaces
    yield string(':')
    value = yield spaces >> regex('[^\n]+')
    return {key: value}

@generate
def header():
    tag_name = yield tag_start
    values = yield sepBy(header_kv, string('\n'))
    tag_name_end = yield tag_stop
    assert tag_name == tag_name_end
    return {
        'type': 'tag',
        'name': tag_name,
        'values': values
    }

@generate
def body():
    tag_name = yield tag_start
    values = yield sepBy(sepBy1(regex(r'[^\n<,]+'), string(',')), string('\n'))
    tag_name_end = yield tag_stop
    assert tag_name == tag_name_end
    return {
        'type': 'tag',
        'name': tag_name,
        'values': values
    }

parser = header + body

If you run parser.parse(mystr), it yields

({'type': 'tag',
  'name': 'kv',
  'values': [{'key1': '"string"'},
             {'key2': '1.00005'},
             {'key3': '[1,2,3]'}]},
 {'type': 'tag',
  'name': 'csv',
  'values': [['date', 'windspeed', 'direction'],
             ['20190805', '22', 'NNW'],
             ['20190805', '23', 'NW'],
             ['20190805', '20', 'NE']]}
)

You can refine the definition of values in the above code to get the result in the exact form you want.

Herbivore answered 13/8, 2019 at 7:31 Comment(0)
I
8

According to the tests, the proper way to parse your string would be the following:

from parsec import *

possible_chars = letter() | space() |  one_of('/.,:"[]') | digit()
parser =  many(many(possible_chars) + string("<") >> mark(many(possible_chars)) << string(">"))

parser.parse(mystr)
# [((1, 1), ['k', 'v'], (1, 3)), ((5, 1), ['/', 'k', 'v'], (5, 4)), ((6, 1), ['c', 's', 'v'], (6, 4)), ((11, 1), ['/', 'c', 's', 'v'], (11, 5))]

The construction of the parser:


For the sake of convenience, we first define the characters we wish to match. parsec provides many types:

  • letter(): matches any alphabetic character,

  • string(str): matches any specified string str,

  • space(): matches any whitespace character,

  • spaces(): matches multiple whitespace characters,

  • digit(): matches any digit,

  • eof(): matches EOF flag of a string,

  • regex(pattern): matches a provided regex pattern,

  • one_of(str): matches any character from the provided string,

  • none_of(str): match characters which are not in the provided string.


We can separate them with operators, according to the docs:

  • |: This combinator implements choice. The parser p | q first applies p. If it succeeds, the value of p is returned. If p fails without consuming any input, parser q is tried. NOTICE: without backtrack,

  • +: Joint two or more parsers into one. Return the aggregate of two results from this two parser.

  • ^: Choice with backtrack. This combinator is used whenever arbitrary look ahead is needed. The parser p || q first applies p, if it success, the value of p is returned. If p fails, it pretends that it hasn't consumed any input, and then parser q is tried.

  • <<: Ends with a specified parser, and at the end parser consumed the end flag,

  • <: Ends with a specified parser, and at the end parser hasn't consumed any input,

  • >>: Sequentially compose two actions, discarding any value produced by the first,

  • mark(p): Marks the line and column information of the result of the parser p.


Then there are multiple "combinators":

  • times(p, mint, maxt=None): Repeats parser p from mint to maxt times,

  • count(p,n): Repeats parser p n-times. If n is smaller or equal to zero, the parser equals to return empty list,

  • (p, default_value=None): Make a parser optional. If success, return the result, otherwise return default_value silently, without raising any exception. If default_value is not provided None is returned instead,

  • many(p): Repeat parser p from never to infinitely many times,

  • many1(p): Repeat parser p at least once,

  • separated(p, sep, mint, maxt=None, end=None): ,

  • sepBy(p, sep): parses zero or more occurrences of parser p, separated by delimiter sep,

  • sepBy1(p, sep): parses at least one occurrence of parser p, separated by delimiter sep,

  • endBy(p, sep): parses zero or more occurrences of p, separated and ended by sep,

  • endBy1(p, sep): parses at least one occurrence of p, separated and ended by sep,

  • sepEndBy(p, sep): parses zero or more occurrences of p, separated and optionally ended by sep,

  • sepEndBy1(p, sep): parses at least one occurrence of p, separated and optionally ended by sep.


Using all of that, we have a parser which matches many occurrences of many possible_chars, followed by a <, then we mark the many occurrences of possible_chars up until >.

Inaugural answered 12/8, 2019 at 19:10 Comment(2)
Thank you, this is very thorough and helpful. Can I ask you two follow-ups? 1. The import * seems non-idiomatic. Do you think this is the only sensible way to use this library all the same? 2. This looks like a great extraction---how would you use this library to make a dictionary-like object like the parse I tried to sketch out?Gorgias
Thanks for this answer, is there a documentation with similar explanation? Could you provide an example for each? Also, why do talk about A || B in the ^ bullet point?Combine
S
4

As others noted, the parse function needs to accept two arguments.
The syntax for multiple input args is:lambda x, y: ...

Unfortunately lambda is not suitable for building a parsec Parser this way since you need to return a parsec.Value type not a boolean, so it will quickly lose its terseness.

The design of parsec requires a Parser to act independently on an input stream without knowledge of any other Parser. To do this effectively a Parser must manage an index position of the input string. They receive the starting index position and return the next position after consuming some tokens. This is why a parsec.Value is returned (boolean, output index) and an input index is required along with an input string.

Here's a basic example consuming a < token, to illustrate:

import parsec

def parse_start_tag(stream, index):
    if stream[0] == '<':
        return parsec.Value.success(index + 1, stream[1:])
    else:
        return parsec.Value.failure(index, '<')

tag_open = parsec.Parser(parse_start_tag)
print(tag_open.parse("<tag>")) # prints: "tag>"
print(tag_open.parse("tag>"))  # fails:   "expected <"
Siskin answered 11/8, 2019 at 22:1 Comment(0)
P
3

Since the parser requires a function that has two alternative results (and two parameters), you may consider breaking the function argument rather than trying to do it with an inline function definition (lambda)

A Parser is an object that wraps a function to do the parsing work. Arguments of the function should be a string to be parsed and the index on which to begin parsing. The function should return either Value.success(next_index, value) if parsing successfully, or Value.failure(index, expected) on the failure

But if you want to use a lambda expression anyway you can specify both required parameters maybe with a lambda like: (Not real sure how the Value.success or Value.failure are expected to work without reading through the docs.)

lamdba x,y: Value.Success(y+1, x) if x[y] == "<" else Value.failure(y, x)
Paronychia answered 11/8, 2019 at 21:58 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.