How to parse and evaluate a math expression with Pandas Dataframe columns?
Asked Answered
B

2

5

What I would like to do is to parse an expression such this one:

result = A + B + sqrt(B + 4)

Where A and B are columns of a dataframe. So I would have to parse the expresion like this in order to get the result:

new_col = df.B + 4
result = df.A + df.B + new_col.apply(sqrt)

Where df is the dataframe.

I have tried with re.sub but it would be good only to replace the column variables (not the functions) like this:

import re

def repl(match):
    inner_word = match.group(1)
    new_var = "df['{}']".format(inner_word)
    return new_var

eq = 'A + 3 / B'
new_eq = re.sub('([a-zA-Z_]+)', repl, eq)
result = eval(new_eq)

So, my questions are:

  • Is there a python library to do this? If not, how can I achieve this in a simple way?
  • Creating a recursive function could be the solution?
  • If I use the "reverse polish notation" could simplify the parsing?
  • Would I have to use the ast module?
Bruns answered 6/11, 2017 at 11:18 Comment(5)
did you try result = df["A"] + df["B"] + sqrt(df["B"] + 4) ? It should workOde
@DimuthTharakaMenikgama read the full question, its not only the same expression.Ingest
Can you show your dataframe.( at least few rows) ?Ode
If I use the sqrt function as you say I get this error TypeError: cannot convert the series to <class 'float'>. So the function must be used with applyBruns
The dataframe could have float64 values, int32 values, even numpy.nan values.Bruns
B
9

Pandas DataFrames do have an eval function. Using your example equation:

import pandas as pd
# create an example DataFrame to work with
df = pd.DataFrame({"A": [1, 2], "B": [3, 4]})
# define equation
eq = 'A + 3 / B'
# actual computation
df.eval(eq)

# more complicated equation
eq = "A + B + sqrt(B + 4)"
df.eval(eq)

Warning

Keep in mind that eval allows to run arbitrary code, which can make you vulnerable to code injection if you pass user input to this function.

Bagatelle answered 6/11, 2017 at 11:39 Comment(4)
Many thanks! It works fine. I would like to use other functions, but I have read this: "The support math functions are sin, cos, exp, log, expm1, log1p, sqrt, sinh, cosh, tanh, arcsin, arccos, arctan, arccosh, arcsinh, arctanh, abs and arctan2". So I am afraid I can use only those functions. Is possible to add external functions to the expression? With the builtin python eval() function is possible to use the local dictionary to add the functions as objects, but I could not make it work with df.eval()Bruns
Well I have writen another question to manage thisBruns
Pleas add a caution to this. eval() allows aribtrary code to be run. This can be dangerous if eval is called on a string that is not sanitized! eval() This allows eval to run arbitrary code, which can make you vulnerable to code injection if you pass user input to this function.Chickasaw
You are right @Tom. I will add the warning to the answer. ThanksBruns
F
1

Following the example provided by @uuazed, a faster way would be using numexpr

import pandas as pd
import numpy as np
import numexpr as ne

df = pd.DataFrame(np.random.randn(int(1e6), 2), columns=['A', 'B'])
eq = "A + B + sqrt(B + 4)"
timeit df.eval(eq)
# 15.9 ms ± 177 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
timeit A=df.A; B=df.B; ne.evaluate(eq)
# 6.24 ms ± 396 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

numexpr may also have more supported operations

Fungus answered 6/11, 2017 at 12:17 Comment(3)
It is faster, but you need to know the variables you are going to use in advance. If not, you must make some parsing before the evaluation, and it will take timeBruns
If you check the eval documentation the engine by default is numexprBruns
Yes, good point! Very curious the long timeit difference just by evaluationFungus

© 2022 - 2024 — McMap. All rights reserved.