How do you apply a function on a dataframe column using data from previous rows?
Asked Answered
R

5

5

I have a Dataframe which has three columns: nums with some values to work with, b which is always either 1 or 0 and the result column which is currently zero everywhere except in the first row (because we must have an initial value to work with). The dataframe looks like this:

   nums   b    result
0  20.0  1    20.0
1  22.0  0    0
2  30.0  1    0
3  29.1  1    0
4  20.0  0    0
...

The Problem

I'd like to go over each row in the dataframe starting with the second row, do some calculation and store the result in the result column. Since I'm working with large files, I need a way to make this operation fast so that's why I want something like apply.

The calculation I want to do is to take the value in nums and in result from the previous row, and if in the current row the b col is 0 then I want (for example) to add the num and the result from that previous row. If b in that row is 1 I'd like to substract them for example.

What have I tried?

I tried using apply but I couldn't access the previous row and sadly it seems that if I do manage to access the previous row, the dataframe won't update the result column until the end.

I also tried using a loop like so, but it's too slow for the large filews I'm working with:

       for i in range(1, len(df.index)):
            row = df.index[i]
            new_row = df.index[i - 1]  # get index of previous row for "nums" and "result"
            df.loc[row, 'result'] = some_calc_func(prev_result=df.loc[new_row, 'result'], prev_num=df.loc[new_row, 'nums'], \
                                     current_b=df.loc[row, 'b'])

some_calc_func looks like this (just a general example):

def some_calc_func(prev_result, prev_num, current_b):
    if current_b == 1:
        return prev_result * prev_num / 2
    else:
        return prev_num + 17

Please answer with respect to some_calc_func

Reticule answered 3/6, 2020 at 10:2 Comment(3)
"I need a way to make this operation fast so that's why I want something like apply" Caution: When should I ever want to use pandas apply() in my codeEmelia
It doesn't have to be apply, I just want something that does the described operation quicklyReticule
Understood, just wanted to let you know apply is not the first thing you should look for when you want speed.Emelia
D
8

If you want to keep the function some_calc_func and not use another library, you should not try to access each element at each iteration, you can use zip on the columns nums and b with a shift between both as you try to access nums from the previous row and keep in memory the prev_res at each iteration. Also, append to a list instead of the dataframe, and after the loop assign the list to the column.

prev_res = df.loc[0, 'result'] #get first result
l_res = [prev_res] #initialize the list of results
# loop with zip to get both values at same time, 
# use loc to start b at second row but not num
for prev_num, curren_b in zip(df['nums'], df.loc[1:, 'b']):
    # use your function to calculate the new prev_res
    prev_res = some_calc_func (prev_res, prev_num, curren_b)
    # add to the list of results
    l_res.append(prev_res)
# assign to the column
df['result'] = l_res
print (df) #same result than with your method
   nums  b  result
0  20.0  1    20.0
1  22.0  0    37.0
2  30.0  1   407.0
3  29.1  1  6105.0
4  20.0  0    46.1

Now with a dataframe df of 5000 rows, I got:

%%timeit
prev_res = df.loc[0, 'result']
l_res = [prev_res]
for prev_num, curren_b in zip(df['nums'], df.loc[1:, 'b']):
    prev_res = some_calc_func (prev_res, prev_num, curren_b)
    l_res.append(prev_res)
df['result'] = l_res
# 4.42 ms ± 695 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

and with your original solution, it was ~750x slower

%%timeit 
for i in range(1, len(df.index)):
    row = df.index[i]
    new_row = df.index[i - 1]  # get index of previous row for "nums" and "result"
    df.loc[row, 'result'] = some_calc_func(prev_result=df.loc[new_row, 'result'], prev_num=df.loc[new_row, 'nums'], \
                             current_b=df.loc[row, 'b'])
#3.25 s ± 392 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

EDIT with another library called numba, if the function some_calc_func can be easily used with Numba decorator.

from numba import jit
# decorate your function
@jit
def some_calc_func(prev_result, prev_num, current_b):
    if current_b == 1:
        return prev_result * prev_num / 2
    else:
        return prev_num + 17

# create a function to do your job
# numba likes numpy arrays
@jit
def with_numba(prev_res, arr_nums, arr_b):
    # array for results and initialize
    arr_res = np.zeros_like(arr_nums)
    arr_res[0] = prev_res
    # loop on the length of arr_b
    for i in range(len(arr_b)):
        #do the calculation and set the value in result array
        prev_res = some_calc_func (prev_res, arr_nums[i], arr_b[i])
        arr_res[i+1] = prev_res
    return arr_res

Finally, call it like

df['result'] = with_numba(df.loc[0, 'result'], 
                          df['nums'].to_numpy(),  
                          df.loc[1:, 'b'].to_numpy())

And with a timeit, I get another ~9x faster than my method with zip, and the speed up could increase with the size

%timeit df['result'] = with_numba(df.loc[0, 'result'], 
                                  df['nums'].to_numpy(),  
                                  df.loc[1:, 'b'].to_numpy()) 
# 526 µs ± 45.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Note using Numba might be problematic depending on your actual some_calc_func

Diarmit answered 13/6, 2020 at 21:9 Comment(6)
Hello! Somehow for me your methods were 1-3 seconds slower (the total time was 600~ seconds so it didn't make a big difference but still). Do you have any other suggestions?Reticule
@Reticule you mean the method with zip is slower than your original method? on your real problem with the real some_calc_func? maybe share a bit more about your real problem like what do you mean by large, what is your actual function, because as you can see on your data multiplied by 1000, zip is way fasterDiarmit
The function is the same (only difference is constants) running on files with 200,000+ rows. I'll keep investigating as it does seem rather odd.Reticule
Here, not even itertuples is faster than the zip way.Mulloy
@MiguelAngelo yeah I would assume so, can you say how much faster is zip compared to itertuples if it is significant?Diarmit
@Diarmit 3 times faster the loop alone... With operations inside the loop, it will probably drop as the inner code grows to dominate the looping overhead.Mulloy
E
2

IIUC:

>>> df['result'] = (df[df.result.eq(0)].b.replace({0: 1, 1: -1}) * df.nums
                    ).fillna(df.result).cumsum()

>>> df
   nums  b  result
0  20.0  1    20.0
1  22.0  0    42.0
2  30.0  1    12.0
3  29.1  1   -17.1
4  20.0  0     2.9

Explanation:

# replace 0 with 1 and 1 with -1 in column `b` for rows where result==0
>>> df[df.result.eq(0)].b.replace({0: 1, 1: -1})
1    1
2   -1
3   -1
4    1
Name: b, dtype: int64

# multiply with nums
>>> (df[df.result.eq(0)].b.replace({0: 1, 1: -1}) * df.nums)
0     NaN
1    22.0
2   -30.0
3   -29.1
4    20.0
dtype: float64

# fill the 'NaN' with the corresponding value from df.result (which is 20 here)
>>> (df[df.result.eq(0)].b.replace({0: 1, 1: -1}) * df.nums).fillna(df.result)
0    20.0
1    22.0
2   -30.0
3   -29.1
4    20.0
dtype: float64

# take the cumulative sum (cumsum)
>>> (df[df.result.eq(0)].b.replace({0: 1, 1: -1}) * df.nums).fillna(df.result).cumsum()
0    20.0
1    42.0
2    12.0
3   -17.1
4     2.9
dtype: float64

According to your requirement in comments, I can not think of a way without loops:

c1, c2 = 2, 1
l = [df.loc[0, 'result']]            # store the first result in a list

# then loop over the series (df.b * df.nums)

for i, val in (df.b * df.nums).iteritems():
    if i:                            # except for 0th index
        if val == 0:                 # (df.b * df.nums) == 0 if df.b == 0
            l.append(l[-1])          # append the last result
        else:                        # otherwise apply the rule
            t = l[-1] *c2 + val * c1
            l.append(t)

>>> l
[20.0, 20.0, 80.0, 138.2, 138.2]

>>> df['result'] = l

   nums  b  result
0  20.0  1    20.0
1  22.0  0    20.0
2  30.0  1    80.0   # [ 20 * 1 +   30 * 2]
3  29.1  1   138.2   # [ 80 * 1 + 29.1 * 2]
4  20.0  0   138.2

Seems fast enough, did not test for large sample.

Emelia answered 3/6, 2020 at 10:23 Comment(4)
Sure, just a momentEmelia
Much appreciated! I will test it with my version of the function and not the addition/substraction :)Reticule
Apologies for the late answer. I tested this several times and can't recreate what you did if I also want to multiply the values in the rows by some constant value. Let c1, c2 be two constants. How should I write the code if I want in the case of b==0 to set the new row in result to be result from the previous row, and in the case of b==1 set the new row in result to be: (c1 times nums from the same row) + (c2 times result from the previous row)?Reticule
Could not think of a way without loop.Emelia
H
1

you have a f(...) to apply, but cannot because you need to keep a memory (of previous) row. You can do this either with a closure or a class. Below is a class implementation:

import pandas as pd

class Func():

    def __init__(self, value):
        self._prev = value
        self._init = True

    def __call__(self, x):
        if self._init:
            res = self._prev
            self._init = False
        elif x.b == 0:
            res = x.nums - self._prev
        else:
            res = x.nums + self._prev

        self._prev = res
        return res

#df = pd.read_clipboard()
f = Func(20)
df['result'] = df.apply(f, axis=1)

You can replace the __call__ with whatever you want in some_calc_func body.

Hotchkiss answered 16/6, 2020 at 18:6 Comment(0)
B
1

I realize this is what @Prodipta's answer was getting at, but this approach uses the global keyword instead to remember the previous result each iteration of apply:

prev_result = 20

def my_calc(row):
    global prev_result
    i = int(row.name)   #the index of the current row
    if i==0:
        return prev_result   
    elif row['b'] == 1:
        out = prev_result * df.loc[i-1,'nums']/2   #loc to get prev_num
    else:
        out = df.loc[i-1,'nums'] + 17
    prev_result = out
    return out

df['result'] = df.apply(my_calc, axis=1)

Result for your example data:

   nums  b  result
0  20.0  1    20.0
1  22.0  0    37.0
2  30.0  1   407.0
3  29.1  1  6105.0
4  20.0  0    46.1

And here's a speed test a la @Ben T's answer - not the best but not the worst?

In[0]
df = pd.DataFrame({'nums':np.random.randint(0,100,5000),'b':np.random.choice([0,1],5000)})

prev_result = 20

%%timeit
df['result'] = df.apply(my_calc, axis=1)

Out[0]
117 ms ± 5.67 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Blasphemous answered 19/6, 2020 at 6:42 Comment(0)
M
0

re-using your loop and some_calc_func

I am using your loop and have reduced it to a bare minimum as below

   for i in range(1, len(df)):
      df.loc[i, 'result'] = some_calc_func(df.loc[i, 'b'], df.loc[i - 1, 'result'], df.loc[i, 'nums'])

and the some_calc_func is implemented as below

def some_calc_func(bval, prev_result, curr_num):
    if bval == 0:
        return prev_result + curr_num
    else:
        return prev_result - curr_num

The result is as below

   nums  b  result
0  20.0  1    20.0
1  22.0  0    42.0
2  30.0  1    12.0
3  29.1  1   -17.1
4  20.0  0     2.9
Miltonmilty answered 13/6, 2020 at 10:54 Comment(1)
Hello. First of all I meant that some_calc_func would remain the same as I wrote it (though in your solution it doesn't make any difference, so Im just pointing out). Second, This is similar to my method except that I'm using dates so I can't simply use i-1Reticule

© 2022 - 2024 — McMap. All rights reserved.