Python parsing bracketed blocks
Asked Answered
R

9

35

What would be the best way in Python to parse out chunks of text contained in matching brackets?

"{ { a } { b } { { { c } } } }"

should initially return:

[ "{ a } { b } { { { c } } }" ]

putting that as an input should return:

[ "a", "b", "{ { c } }" ]

which should return:

[ "{ c }" ]

[ "c" ]

[]
Reluct answered 30/10, 2009 at 18:18 Comment(4)
I'm curious. In your last example, would it return [ "{ c }" ] and then that would return [ "c" ] and then that would return [ ]?Dorcia
Shouldn't second statement be ["{ a } { b } { { { c } } }"] instead?Topheavy
The second statement is [ "{ a } { b } { { { c } } }" ], exactly what you wrote... I'm going to complete the entire sequence so people know exactly what I wantReluct
I think you want [ "{ a } { b } { { { c } } }" ] to go to [ "a", "b", "{ { c } }" ], am i right?Cocteau
C
32

Pseudocode:

For each string in the array:
    Find the first '{'. If there is none, leave that string alone.
    Init a counter to 0. 
    For each character in the string:  
        If you see a '{', increment the counter.
        If you see a '}', decrement the counter.
        If the counter reaches 0, break.
    Here, if your counter is not 0, you have invalid input (unbalanced brackets)
    If it is, then take the string from the first '{' up to the '}' that put the
     counter at 0, and that is a new element in your array.
Cocteau answered 30/10, 2009 at 18:35 Comment(6)
Wonderful! I had the designs of something like this in my head, but not quite so well formulated. Out of interest, is there a formal name for this kind of parser? I'm familiar with recursive descent parsers from a module we did at uni, but not this kind.Reluct
I'm not sure if it has a name. This might just be a short-circuited way of doing parens matching, without doing a full-blown recursive descent parser. Maybe someone else knows better.Cocteau
To add: the counter represents how deep in the 'recursion' you are, without having to do function calls.Cocteau
Your indentation is off. As written, with the input "{a}{b}" you will return "a{b}" and not "ab". The "Here" part needs to be indented. This is Python and indentation counts.Quatre
No, I think his indentation is correct. If the loop terminates with the counter equal to zero, then all is fine and a matched bracket pair has been found. However if it terminates with counter != zero then the loop simply terminated because there were no more characters, and thus the brackets were unmatched.Reluct
Shouldn't the last statement (which adds the new element in the array) be looped ? That way, all the children of the same node will be added at once ?Redaredact
H
56

Or this pyparsing version:

>>> from pyparsing import nestedExpr
>>> txt = "{ { a } { b } { { { c } } } }"
>>>
>>> nestedExpr('{','}').parseString(txt).asList()
[[['a'], ['b'], [[['c']]]]]
>>>
Horick answered 31/10, 2009 at 1:50 Comment(1)
This is far more readable than the algorithm in the accepted answer, and I also trust a widely used (and therefore tested) library over my own home grown solution. One very important thing to note: if your entire expression is not enclosed by a pair of grouping symbols, then this will only process the first grouped expression. If you always want to process the entire expression, you may be able to force this by adding an outer pair of grouping symbols when they aren't already present.Mcquoid
C
32

Pseudocode:

For each string in the array:
    Find the first '{'. If there is none, leave that string alone.
    Init a counter to 0. 
    For each character in the string:  
        If you see a '{', increment the counter.
        If you see a '}', decrement the counter.
        If the counter reaches 0, break.
    Here, if your counter is not 0, you have invalid input (unbalanced brackets)
    If it is, then take the string from the first '{' up to the '}' that put the
     counter at 0, and that is a new element in your array.
Cocteau answered 30/10, 2009 at 18:35 Comment(6)
Wonderful! I had the designs of something like this in my head, but not quite so well formulated. Out of interest, is there a formal name for this kind of parser? I'm familiar with recursive descent parsers from a module we did at uni, but not this kind.Reluct
I'm not sure if it has a name. This might just be a short-circuited way of doing parens matching, without doing a full-blown recursive descent parser. Maybe someone else knows better.Cocteau
To add: the counter represents how deep in the 'recursion' you are, without having to do function calls.Cocteau
Your indentation is off. As written, with the input "{a}{b}" you will return "a{b}" and not "ab". The "Here" part needs to be indented. This is Python and indentation counts.Quatre
No, I think his indentation is correct. If the loop terminates with the counter equal to zero, then all is fine and a matched bracket pair has been found. However if it terminates with counter != zero then the loop simply terminated because there were no more characters, and thus the brackets were unmatched.Reluct
Shouldn't the last statement (which adds the new element in the array) be looped ? That way, all the children of the same node will be added at once ?Redaredact
S
7

I'm kind of new to Python, so go easy on me, but here is an implementation that works:

def balanced_braces(args):
    parts = []
    for arg in args:
        if '{' not in arg:
            continue
        chars = []
        n = 0
        for c in arg:
            if c == '{':
                if n > 0:
                    chars.append(c)
                n += 1
            elif c == '}':
                n -= 1
                if n > 0:
                    chars.append(c)
                elif n == 0:
                    parts.append(''.join(chars).lstrip().rstrip())
                    chars = []
            elif n > 0:
                chars.append(c)
    return parts

t1 = balanced_braces(["{{ a } { b } { { { c } } } }"]);
print t1
t2 = balanced_braces(t1)
print t2
t3 = balanced_braces(t2)
print t3
t4 = balanced_braces(t3)
print t4

Output:

['{ a } { b } { { { c } } }']
['a', 'b', '{ { c } }']
['{ c }']
['c']
Sandy answered 30/10, 2009 at 19:26 Comment(0)
T
5

Parse using lepl (installable via $ easy_install lepl):

from lepl import Any, Delayed, Node, Space

expr = Delayed()
expr += '{' / (Any() | expr[1:,Space()[:]]) / '}' > Node

print expr.parse("{{a}{b}{{{c}}}}")[0]

Output:

Node
 +- '{'
 +- Node
 |   +- '{'
 |   +- 'a'
 |   `- '}'
 +- Node
 |   +- '{'
 |   +- 'b'
 |   `- '}'
 +- Node
 |   +- '{'
 |   +- Node
 |   |   +- '{'
 |   |   +- Node
 |   |   |   +- '{'
 |   |   |   +- 'c'
 |   |   |   `- '}'
 |   |   `- '}'
 |   `- '}'
 `- '}'
Tidemark answered 30/10, 2009 at 23:54 Comment(0)
V
3

Cleaner solution. This will find return the string enclosed in the outermost bracket. If None is returned, there was no match.

def findBrackets( aString ):
   if '{' in aString:
      match = aString.split('{',1)[1]
      open = 1
      for index in xrange(len(match)):
         if match[index] in '{}':
            open = (open + 1) if match[index] == '{' else (open - 1)
         if not open:
            return match[:index]
Virendra answered 6/5, 2010 at 10:51 Comment(0)
R
2

You could also parse them all at once, though I find the {a} to mean "a" rather than ["a"] slightly weird. If I've understood the format correctly:

import re
import sys


_mbrack_rb = re.compile("([^{}]*)}") # re.match doesn't have a pos parameter
def mbrack(s):
  """Parse matching brackets.

  >>> mbrack("{a}")
  'a'
  >>> mbrack("{{a}{b}}")
  ['a', 'b']
  >>> mbrack("{{a}{b}{{{c}}}}")
  ['a', 'b', [['c']]]

  >>> mbrack("a")
  Traceback (most recent call last):
  ValueError: expected left bracket
  >>> mbrack("{a}{b}")
  Traceback (most recent call last):
  ValueError: more than one root
  >>> mbrack("{a")
  Traceback (most recent call last):
  ValueError: expected value then right bracket
  >>> mbrack("{a{}}")
  Traceback (most recent call last):
  ValueError: expected value then right bracket
  >>> mbrack("{a}}")
  Traceback (most recent call last):
  ValueError: unbalanced brackets (found right bracket)
  >>> mbrack("{{a}")
  Traceback (most recent call last):
  ValueError: unbalanced brackets (not enough right brackets)
  """
  stack = [[]]
  i, end = 0, len(s)
  while i < end:
    if s[i] != "{":
      raise ValueError("expected left bracket")
    elif i != 0 and len(stack) == 1:
      raise ValueError("more than one root")
    while i < end and s[i] == "{":
      L = []
      stack[-1].append(L)
      stack.append(L)
      i += 1
    stack.pop()
    stack[-1].pop()
    m = _mbrack_rb.match(s, i)
    if m is None:
      raise ValueError("expected value then right bracket")
    stack[-1].append(m.group(1))
    i = m.end(0)
    while i < end and s[i] == "}":
      if len(stack) == 1:
        raise ValueError("unbalanced brackets (found right bracket)")
      stack.pop()
      i += 1
  if len(stack) != 1:
    raise ValueError("unbalanced brackets (not enough right brackets)")
  return stack[0][0]


def main(args):
  if args:
    print >>sys.stderr, "unexpected arguments: %r" % args
  import doctest
  r = doctest.testmod()
  print r
  return r[0]

if __name__ == "__main__":
  sys.exit(main(sys.argv[1:]))
Reachmedown answered 30/10, 2009 at 19:42 Comment(0)
G
2

If you want to use a parser (lepl in this case), but still want the intermediate results rather than a final parsed list, then I think this is the kind of thing you were looking for:

>>> nested = Delayed()
>>> nested += "{" + (nested[1:,...]|Any()) + "}"
>>> split = (Drop("{") & (nested[:,...]|Any()) & Drop("}"))[:].parse
>>> split("{{a}{b}{{{c}}}}")
['{a}{b}{{{c}}}']
>>> split("{a}{b}{{{c}}}")
['a', 'b', '{{c}}']
>>> split("{{c}}")
['{c}']
>>> split("{c}")
['c']

That might look opaque at first, but it's fairly simple really :o)

nested is a recursive definition of a matcher for nested brackets (the "+" and [...] in the definition keep everything as a single string after it has been matched). Then split says match as many as possible ("[:]") of something that is surrounded by "{" ... "}" (which we discard with "Drop") and contains either a nested expression or any letter.

Finally, here's a lepl version of the "all in one" parser that gives a result in the same format as the pyparsing example above, but which (I believe) is more flexible about how spaces appear in the input:

>>> with Separator(~Space()[:]):
...     nested = Delayed()
...     nested += Drop("{") & (nested[1:] | Any()) & Drop("}") > list
...
>>> nested.parse("{{ a }{ b}{{{c}}}}")
[[['a'], ['b'], [[['c']]]]]
Goldenseal answered 31/10, 2009 at 12:38 Comment(0)
T
1

Using Grako (grammar compiler):

#!/usr/bin/env python
import json
import grako # $ pip install grako

grammar_ebnf = """
    bracketed = '{' @:( { bracketed }+ | any ) '}' ;
    any = /[^{}]+?/ ;
"""
model = grako.genmodel("Bracketed", grammar_ebnf)
ast = model.parse("{ { a } { b } { { { c } } } }", "bracketed")
print(json.dumps(ast, indent=4))

Output

[
    "a", 
    "b", 
    [
        [
            "c"
        ]
    ]
]
Tidemark answered 5/10, 2014 at 18:17 Comment(0)
D
1

Here is a solution I came up with for a similar use case. This was loosely based on the accepted psuedo code answer. I didn't want to add any dependencies for external libraries:

def parse_segments(source, recurse=False):
    """
    extract any substring enclosed in parenthesis
    source should be a string
    """
    unmatched_count = 0
    start_pos = 0
    opened = False
    open_pos = 0
    cur_pos = 0

    finished = []
    segments = []

    for character in source:
        #scan for mismatched parenthesis:
        if character == '(':
            unmatched_count += 1
            if not opened:
                open_pos = cur_pos
            opened = True

        if character == ')':
            unmatched_count -= 1

        if opened and unmatched_count == 0:
            segment = source[open_pos:cur_pos+1]
            segments.append(segment)
            clean = source[start_pos:open_pos]
            if clean:
                finished.append(clean)
            opened = False
            start_pos = cur_pos+1

        cur_pos += 1

    assert unmatched_count == 0

    if start_pos != cur_pos:
        #get anything that was left over here
        finished.append(source[start_pos:cur_pos])

    #now check on recursion:
    for item in segments:
        #get rid of bounding parentheses:
        pruned = item[1:-1]
        if recurse:
            results = parse_tags(pruned, recurse)
            finished.expand(results)
        else:
            finished.append(pruned)

    return finished
Dimmick answered 8/12, 2014 at 15:22 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.