How can I split a string of a mathematical expressions in python?

Asked 13/4, 2017 at 10:22 Answered 14/4, 2020 at 1:52

Solved python string python-3.x split tokenize

I made a program which convert infix to postfix in python. The problem is when I introduce the arguments. If i introduce something like this: (this will be a string)

( ( 73 + ( ( 34 - 72 ) / ( 33 - 3 ) ) ) + ( 56 + ( 95 - 28 ) ) )

it will split it with .split() and the program will work correctly. But I want the user to be able to introduce something like this:

((73 + ( (34- 72 ) / ( 33 -3) )) + (56 +(95 - 28) ) )

As you can see I want that the blank spaces can be trivial but the program continue splitting the string by parentheses, integers (not digits) and operands.

I try to solve it with a for but I don't know how to catch the whole number (73 , 34 ,72) instead one digit by digit (7, 3 , 3 , 4 , 7 , 2)

To sum up, what I want is split a string like ((81 * 6) /42+ (3-1)) into:

[(, (, 81, *, 6, ), /, 42, +, (, 3, -, 1, ), )]

Apophasis answered 13/4, 2017 at 10:22 Comment(8)

The word for what you're doing is "tokenizing" - I've replaced your list tag with tokenize. – Threw 13/4, 2017 at 10:25

regex isn't that good for nested brackets. grako is good but may be heavyweight if this is all you need to do. I like grako because you end up with readable code. – Algin 13/4, 2017 at 10:31

Thanks for the question, it helped me reach the silver Python badge! – Simson 13/4, 2017 at 14:28

Have a look at sympy's parsing – Mitman 13/4, 2017 at 14:29

@MichaelGrazebrook It's still reasonable to use regex to get the tokens, and then you can validate the brackets in a later stage (maybe with a parser library, maybe with a hand-written recursive descent or shunting yard parser) – Omen 13/4, 2017 at 15:0

@Omen A regex is OK for a single pair of brackets but quickly becomes unreadable for a general case. grako would be over-engineered for the simple case but for a general expression grammar is excellent. It's also readable. – Algin 17/4, 2017 at 8:30

@MichaelGrazebrook You're misunderstanding the difference between tokenizing and parsing. Tokenizing doesn't deal with checking whether the brackets are balanced (or anything else that doesn't make sense like having two operators next to each other, an operator before a closed bracket, etc), it's just separating so that a single bracket counts as a 'word', which is all that the question is asking for. – Omen 17/4, 2017 at 13:27

@Omen No misunderstanding. Although Fernaku's question can be answered in terms of pure tokenising, the context suggests balancing brackets is relevant . We agree parsing is good for that. grako includes tokenising in its terminal expressions. – Algin 19/4, 2017 at 13:13

Tree with `ast`

You could use ast to get a tree of the expression :

import ast

source = '((81 * 6) /42+ (3-1))'
node = ast.parse(source) 

def show_children(node, level=0):
    if isinstance(node, ast.Num):
        print(' ' * level + str(node.n))
    else:
        print(' ' * level + str(node))
    for child in ast.iter_child_nodes(node):
        show_children(child, level+1)

show_children(node)

It outputs :

<_ast.Module object at 0x7f56abbc5490>
 <_ast.Expr object at 0x7f56abbc5350>
  <_ast.BinOp object at 0x7f56abbc5450>
   <_ast.BinOp object at 0x7f56abbc5390>
    <_ast.BinOp object at 0x7f56abb57cd0>
     81
     <_ast.Mult object at 0x7f56abbd0dd0>
     6
    <_ast.Div object at 0x7f56abbd0e50>
    42
   <_ast.Add object at 0x7f56abbd0cd0>
   <_ast.BinOp object at 0x7f56abb57dd0>
    3
    <_ast.Sub object at 0x7f56abbd0d50>
    1

As @user2357112 wrote in the comments : ast.parse interprets Python syntax, not mathematical expressions. (1+2)(3+4) would be parsed as a function call and list comprehensions would be accepted even though they probably shouldn't be considered a valid mathematical expression.

List with a regex

If you want a flat structure, a regex could work :

import re

number_or_symbol = re.compile('(\d+|[^ 0-9])')
print(re.findall(number_or_symbol, source))
# ['(', '(', '81', '*', '6', ')', '/', '42', '+', '(', '3', '-', '1', ')', ')']

It looks for either :

multiple digits
or any character which isn't a digit or a space

Once you have a list of elements, you could check if the syntax is correct, for example with a stack to check if parentheses are matching, or if every element is a known one.

Simson answered 13/4, 2017 at 10:31 Comment(2)

Watch out - ast.parse parses Python syntax, not mathematical expressions. For example, (1+2)(3+4) gets parsed as a function call, 3^4 is treated as XOR, and [x**2 for x in range(5)] is treated as a perfectly ordinary input instead of nonsense that should be rejected. – Haimes 13/4, 2017 at 19:42

I've seen calculators which were pretty picky about leaving * out with parens but I cannot remember any concrete example. AFAIK, ^ also isn't used by every single calculator. Still, you make a very good point. I tried to integrate it into the answer. Thanks – Simson 13/4, 2017 at 21:48

You need to implement a very simple tokenizer for your input. You have the following types of tokens:

(
)
+
-
*
/
\d+

You can find them in your input string separated by all sorts of white space.

So a first step is to process the string from start to finish, and extract these tokens, and then do your parsing on the tokens, rather than on the string itself.

A nifty way to do this is to use the following regular expression: '\s*([()+*/-]|\d+)'. You can then:

import re

the_input='(3+(2*5))'
tokens = []
tokenizer = re.compile(r'\s*([()+*/-]|\d+)')
current_pos = 0
while current_pos < len(the_input):
  match = tokenizer.match(the_input, current_pos)
  if match is None:
     raise Error('Syntax error')
  tokens.append(match.group(1))
  current_pos = match.end()
print(tokens)

This will print ['(', '3', '+', '(', '2', '*', '5', ')', ')']

You could also use re.findall or re.finditer, but then you'd be skipping non-matches, which are syntax errors in this case.

Sidestroke answered 13/4, 2017 at 10:35 Comment(2)

@EricDuminil oops. I should have used = instead of += when updating current_pos, as match.end() is relative to the whole string, not current_pos. I've also added an example. – Sidestroke 13/4, 2017 at 11:3

It works fine now. input shouldn't be used as a variable name, though. – Simson 13/4, 2017 at 11:55

If you don't want to use re module, you can try this:

s="((81 * 6) /42+ (3-1))"

r=[""]

for i in s.replace(" ",""):
    if i.isdigit() and r[-1].isdigit():
        r[-1]=r[-1]+i
    else:
        r.append(i)
print(r[1:])

Output:

['(', '(', '81', '*', '6', ')', '/', '42', '+', '(', '3', '-', '1', ')', ')']

Dug answered 13/4, 2017 at 11:6 Comment(1)

@EricDuminil Thanks for your reminding, I fixed the index error by adding a empty string. – Dug 13/4, 2017 at 11:56

It actual would be pretty trivial to hand-roll a simple expression tokenizer. And I'd think you'd learn more that way as well.

So for the sake of education and learning, Here is a trivial expression tokenizer implementation which can be extended. It works based upon the "maximal-much" rule. This means it acts "greedy", trying to consume as many characters as it can to construct each token.

Without further ado, here is the tokenizer:

class ExpressionTokenizer:
    def __init__(self, expression, operators):
        self.buffer = expression
        self.pos = 0
        self.operators = operators

    def _next_token(self):
        atom = self._get_atom()

        while atom and atom.isspace():
            self._skip_whitespace()
            atom = self._get_atom()

        if atom is None:
            return None
        elif atom.isdigit():
            return self._tokenize_number()
        elif atom in self.operators:
            return self._tokenize_operator()
        else:
            raise SyntaxError()

    def _skip_whitespace(self):
        while self._get_atom():
            if self._get_atom().isspace():
                self.pos += 1
            else:
                break

    def _tokenize_number(self):
        endpos = self.pos + 1
        while self._get_atom(endpos) and self._get_atom(endpos).isdigit():
            endpos += 1
        number = self.buffer[self.pos:endpos]
        self.pos = endpos
        return number

    def _tokenize_operator(self):
        operator = self.buffer[self.pos]
        self.pos += 1
        return operator

    def _get_atom(self, pos=None):
        pos = pos or self.pos
        try:
            return self.buffer[pos]
        except IndexError:
            return None

    def tokenize(self):
        while True:
            token = self._next_token()
            if token is None:
                break
            else:
                yield token

Here is a demo the usage:

tokenizer = ExpressionTokenizer('((81 * 6) /42+ (3-1))', {'+', '-', '*', '/', '(', ')'})
for token in tokenizer.tokenize():
    print(token)

Which produces the output:

(
(
81
*
6
)
/
42
+
(
3
-
1
)
)

Tashia answered 13/4, 2017 at 16:5 Comment(0)

Quick regex answer: re.findall(r"\d+|[()+\-*\/]", str_in)

Demonstration:

>>> import re
>>> str_in = "((81 * 6) /42+ (3-1))"
>>> re.findall(r"\d+|[()+\-*\/]", str_in)
['(', '(', '81', '*', '6', ')', '/', '42', '+', '(', '3', '-', '1', 
')', ')']

For the nested parentheses part, you can use a stack to keep track of the level.

Immateriality answered 13/4, 2017 at 10:35 Comment(0)

This does not provide quite the result you want but might be of interest to others who view this question. It makes use of the pyparsing library.

# Stolen from http://pyparsing.wikispaces.com/file/view/simpleArith.py/30268305/simpleArith.py
# Copyright 2006, by Paul McGuire
# ... and slightly altered

from pyparsing import *

integer = Word(nums).setParseAction(lambda t:int(t[0]))
variable = Word(alphas,exact=1)
operand = integer | variable

expop = Literal('^')
signop = oneOf('+ -')
multop = oneOf('* /')
plusop = oneOf('+ -')
factop = Literal('!')

expr = operatorPrecedence( operand,
    [("!", 1, opAssoc.LEFT),
     ("^", 2, opAssoc.RIGHT),
     (signop, 1, opAssoc.RIGHT),
     (multop, 2, opAssoc.LEFT),
     (plusop, 2, opAssoc.LEFT),]
    )

print (expr.parseString('((81 * 6) /42+ (3-1))'))

Output:

[[[[81, '*', 6], '/', 42], '+', [3, '-', 1]]]

Arabesque answered 13/4, 2017 at 17:5 Comment(1)

Pyparsing is no longer hosted on wikispaces.com. Go to github.com/pyparsing/pyparsing – Biron 27/8, 2018 at 13:1

Using grako:

start = expr $;
expr = calc | value;
calc = value operator value;
value = integer | "(" @:expr ")" ;
operator = "+" | "-" | "*" | "/";
integer = /\d+/;

grako transpiles to python.

For this example, the return value looks like this:

['73', '+', ['34', '-', '72', '/', ['33', '-', '3']], '+', ['56', '+', ['95', '-', '28']]]

Normally you'd use the generated semantics class as a template for further processing.

Algin answered 19/4, 2017 at 15:4 Comment(0)

To provide a more verbose regex approach that you could easily extend:

import re

solution = []
pattern = re.compile('([\d\.]+)')

s = '((73 + ( (34- 72 ) / ( 33 -3) )) + (56 +(95 - 28) ) )'

for token in re.split(pattern, s):
    token = token.strip()
    if re.match(pattern, token):
        solution.append(float(token))
        continue
    for character in re.sub(' ', '', token):
        solution.append(character)

Which will give you the result:

 solution = ['(', '(', 73, '+', '(', '(', 34, '-', 72, ')', '/', '(', 33, '-', 3, ')', ')', ')', '+', '(', 56, '+', '(', 95, '-', 28, ')', ')', ')']

Prothalamium answered 14/4, 2017 at 4:48 Comment(0)

Similar to @McGrady's answer, you can do this with a basic queue implementation. As a very basic implementation, here's what your Queue class can look like:

class Queue:

    EMPTY_QUEUE_ERR_MSG = "Cannot do this operation on an empty queue."

    def __init__(self):
        self._items = []

    def __len__(self) -> int:
        return len(self._items)

    def is_empty(self) -> bool:
        return len(self) == 0

    def enqueue(self, item):
        self._items.append(item)

    def dequeue(self):
        try:
            return self._items.pop(0)
        except IndexError:
            raise RuntimeError(Queue.EMPTY_QUEUE_ERR_MSG)

    def peek(self):
        try:
            return self._items[0]
        except IndexError:
            raise RuntimeError(Queue.EMPTY_QUEUE_ERR_MSG)

Using this simple class, you can implement your parse function as:

def tokenize_with_queue(exp: str) -> List:
    queue = Queue()
    cum_digit = ""
    for c in exp.replace(" ", ""):
        if c in ["(", ")", "+", "-", "/", "*"]:
            if cum_digit != "":
                queue.enqueue(cum_digit)
                cum_digit = ""
            queue.enqueue(c)
        elif c.isdigit():
            cum_digit += c
        else:
            raise ValueError
    if cum_digit != "": #one last sweep in case there are any digits waiting
        queue.enqueue(cum_digit)
    return [queue.dequeue() for i in range(len(queue))]

Testing it like below:

exp = "((73 + ( (34- 72 ) / ( 33 -3) )) + (56 +(95 - 28) ) )"
print(tokenize_with_queue(exp)")

would give you the token list as:

['(', '(', '73', '+', '(', '(', '34', '-', '72', ')', '/', '(', '33', '-', '3', ')', ')', ')', '+', '(', '56', '+', '(', '95', '-', '28', ')', ')', ')']

Vandusen answered 14/4, 2020 at 1:52 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Tree with ast

List with a regex

Recommended topics

Hot tags

Tree with `ast`