Pandas: How to make apply on dataframe faster?

Asked 11/1, 2017 at 10:14 Answered 26/2, 2024 at 11:3

Solved python python-2.7 pandas numpy apply

Consider this pandas example where I'm calculating column C by multiplying A with B and a float if a certain condition is fulfilled using apply with a lambda function:

import pandas as pd
df = pd.DataFrame({'A':[1,2,3,4,5,6,7,8,9],'B':[9,8,7,6,5,4,3,2,1]})

df['C'] = df.apply(lambda x: x.A if x.B > 5 else 0.1*x.A*x.B, axis=1)

The expected result would be:

   A  B    C
0  1  9  1.0
1  2  8  2.0
2  3  7  3.0
3  4  6  4.0
4  5  5  2.5
5  6  4  2.4
6  7  3  2.1
7  8  2  1.6
8  9  1  0.9

The problem is that this code is slow and I need to do this operation on a dataframe with around 56 million rows.

The %timeit-result of the above lambda operation is:

1000 loops, best of 3: 1.63 ms per loop

Going from the calculation time and also the memory usage when doing this on my large dataframe I presume this operation uses intermediary series while doing the calculations.

I tried to formulate it in different ways including using temporary columns, but every alternative solution I came up with is even slower.

Is there a way to get the result I need in a different and faster way, e.g. by using numpy?

Alum answered 11/1, 2017 at 10:14 Comment(1)

You should look into numpy.where. – Nanci 11/1, 2017 at 10:16

For performance, you might be better off working with NumPy array and using np.where -

a = df.values # Assuming you have two columns A and B
df['C'] = np.where(a[:,1]>5,a[:,0],0.1*a[:,0]*a[:,1])

Runtime test

def numpy_based(df):
    a = df.values # Assuming you have two columns A and B
    df['C'] = np.where(a[:,1]>5,a[:,0],0.1*a[:,0]*a[:,1])

Timings -

In [271]: df = pd.DataFrame(np.random.randint(0,9,(10000,2)),columns=[['A','B']])

In [272]: %timeit numpy_based(df)
1000 loops, best of 3: 380 µs per loop

In [273]: df = pd.DataFrame(np.random.randint(0,9,(10000,2)),columns=[['A','B']])

In [274]: %timeit df['C'] = df.A.where(df.B.gt(5), df[['A', 'B']].prod(1).mul(.1))
100 loops, best of 3: 3.39 ms per loop

In [275]: df = pd.DataFrame(np.random.randint(0,9,(10000,2)),columns=[['A','B']])

In [276]: %timeit df['C'] = np.where(df['B'] > 5, df['A'], 0.1 * df['A'] * df['B'])
1000 loops, best of 3: 1.12 ms per loop

In [277]: df = pd.DataFrame(np.random.randint(0,9,(10000,2)),columns=[['A','B']])

In [278]: %timeit df['C'] = np.where(df.B > 5, df.A, df.A.mul(df.B).mul(.1))
1000 loops, best of 3: 1.19 ms per loop

Closer look

Let's take a closer look at NumPy's number crunching capability and compare with pandas into the mix -

# Extract out as array (its a view, so not really expensive
#   .. as compared to the later computations themselves)

In [291]: a = df.values 

In [296]: %timeit df.values
10000 loops, best of 3: 107 µs per loop

Case #1 : Work with NumPy array and use numpy.where :

In [292]: %timeit np.where(a[:,1]>5,a[:,0],0.1*a[:,0]*a[:,1])
10000 loops, best of 3: 86.5 µs per loop

Again, assigning into a new column : df['C'] would not be very expensive either -

In [300]: %timeit df['C'] = np.where(a[:,1]>5,a[:,0],0.1*a[:,0]*a[:,1])
1000 loops, best of 3: 323 µs per loop

Case #2 : Work with pandas dataframe and use its .where method (no NumPy)

In [293]: %timeit df.A.where(df.B.gt(5), df[['A', 'B']].prod(1).mul(.1))
100 loops, best of 3: 3.4 ms per loop

Case #3 : Work with pandas dataframe (no NumPy array), but use numpy.where -

In [294]: %timeit np.where(df['B'] > 5, df['A'], 0.1 * df['A'] * df['B'])
1000 loops, best of 3: 764 µs per loop

Case #4 : Work with pandas dataframe again (no NumPy array), but use numpy.where -

In [295]: %timeit np.where(df.B > 5, df.A, df.A.mul(df.B).mul(.1))
1000 loops, best of 3: 830 µs per loop

Grith answered 11/1, 2017 at 10:16 Comment(5)

You beat me to it, but numpy.where plays well with pandas series, and I believe that my version is more readable ;) – Nanci 11/1, 2017 at 10:19

@Nanci OP asked for faster. This brings everything into numpy allowing for more efficiencies. – Shoe 11/1, 2017 at 10:26

@Nanci NumPy plays well with number crunching and surely with dataframes, at least that I know! ;) – Grith 11/1, 2017 at 10:27

I'm getting 1000 loops, best of 3: 255 µs per loop from this solution, thanks a lot for this. – Alum 11/1, 2017 at 10:55

I can hardly believe it, the operation on my 56 million row dataframe took around 1 second. – Alum 11/1, 2017 at 11:8

pure pandas
using pd.Series.where

df['C'] = df.A.where(df.B.gt(5), df[['A', 'B']].prod(1).mul(.1))

   A  B    C
0  1  9  1.0
1  2  8  2.0
2  3  7  3.0
3  4  6  4.0
4  5  5  2.5
5  6  4  2.4
6  7  3  2.1
7  8  2  1.6
8  9  1  0.9

Shoe answered 11/1, 2017 at 10:20 Comment(3)

nice, prod is faster as mul? – Ferocious 11/1, 2017 at 10:25

maybe... barely – Shoe 11/1, 2017 at 10:27

@Nanci because it's pure pandas.... slower. That's why we all raced for numpy. But Divakar beat us too it. I provided this answer because its different. Hardly anyone uses where from a pandas perspective. It's interesting because it's assumes the existing values when the condition is True or it takes the alternative. – Shoe 11/1, 2017 at 10:41

Pandas is a great tool for data manipulation but runs on a single CPU core by default. In addition, Pandas is built to run vectorized API functions on entire columns or datasets in one sweep, but apply runs custom user code. The other answers avoid the use of apply with custom code but this may not be possible/practical in general. If processing large datasets with apply is a pain point for you, you should consider an acceleration and scaling solution such as Bodo. Bodo directly compiles your apply code to optimize it in ways that Pandas cannot. In addition to vectorizing your code, Bodo provides automatic parallelization. You can run your code up to 4 cores with the Bodo community edition (which is free to use). Here is a link to the Bodo installation instruction: https://docs.bodo.ai/latest/source/installation_and_setup/install.html

I generated a similar dataset to yours but with 20 million rows and ran the code with regular Pandas on one core and with Bodo on 4 cores. With regular Pandas, it takes about 6.5 minutes to run your code while with Bodo’s community edition it takes around half a second.

#data generation
import numpy as np
import pandas as pd

df = pd.DataFrame(np.random.randint(1,10,size=(20000000, 2)), columns=list('AB'))
df.to_parquet("data.pq")

Regular Pandas:

import pandas as pd
import time

start = time.time()

df = pd.read_parquet("data.pq")
df['C'] = df.apply(lambda x: x.A if x.B > 5 else 0.1*x.A*x.B, axis=1)

end = time.time()
print("computation time: ", end - start)

print(df.head())

output:
computation time:  378.3832001686096
   A  B    C
0  3  5  1.5
1  8  6  8.0
2  1  7  1.0
3  8  1  0.8
4  4  8  4.0

With Bodo:

%%px

import pandas as pd
import time
import bodo

@bodo.jit(distributed = ['df'])
def apply():
    start = time.time()
    df = pd.read_parquet("data.pq")
    df['C'] = df.apply(lambda x: x.A if x.B > 5 else 0.1*x.A*x.B, axis=1)
    end = time.time()
    print("computation time: ", end - start)
    print(df.head())
    return df
df = apply()

output:
[stdout:0] 
computation time:  0.3610380489999443
   A  B    C
0  3  5  1.5
1  8  6  8.0
2  1  7  1.0
3  8  1  0.8
4  4  8  4.0

Disclaimer: I work as a data scientist advocate in Bodo.ai.

Baccivorous answered 4/5, 2021 at 16:21 Comment(2)

I tried bodo on my problem and got an error DataFrame.loc[] getitem (location-based indexing) using Tuple(unicode_type, Literal[int](2020)) not supported yet. – Shawnee 12/5, 2022 at 0:28

didnot work. BodoError: Cannot call non-JIT function 'loads' from JIT function (convert to JIT or use objmode). – Cervicitis 28/11, 2022 at 23:45

Using numpy.where:

df['C'] = numpy.where(df['B'] > 5, df['A'], 0.1 * df['A'] * df['B'])

Nanci answered 11/1, 2017 at 10:18 Comment(0)

Use:

df['C'] = np.where(df.B > 5, df.A, df.A.mul(df.B).mul(.1))
print (df)
   A  B    C
0  1  9  1.0
1  2  8  2.0
2  3  7  3.0
3  4  6  4.0
4  5  5  2.5
5  6  4  2.4
6  7  3  2.1
7  8  2  1.6
8  9  1  0.9

Ferocious answered 11/1, 2017 at 10:18 Comment(6)

mul is not very different from *, is it? ;) – Nanci 11/1, 2017 at 10:19

I do some research and it seems if use df.A*df.B and df.A.mul(df.B) then mul is faster. But if multiple by constant, is is same. – Ferocious 11/1, 2017 at 10:20

@Nanci also, it's convenient for chaining – Shoe 11/1, 2017 at 10:22

Hmmm, here is faster *, maybe reason is np.where works with numpy arrays. – Ferocious 11/1, 2017 at 10:37

It was also a reference to this other time where jezrael beat me to an answer, and the only difference between us was that I used / where he had used div, which he pointed out made little difference. – Nanci 11/1, 2017 at 10:38

@Ferocious I was surprised too, I also expected mul to be faster (if not by much). – Nanci 11/1, 2017 at 10:39

pandas 2.2 update: `apply` now supports `engine='numba'`

More info in the release notes as well as GH54666

Choose between the python (default) engine or the numba engine in apply.

The numba engine will attempt to JIT compile the passed function, which may result in speedups for large DataFrames. It also supports the following engine_kwargs :

nopython (compile the function in nopython mode)

nogil (release the GIL inside the JIT compiled function)

parallel (try to apply the function in parallel over the DataFrame)

Note: Due to limitations within numba/how pandas interfaces with numba, you should only use this if raw=True

This is useful for situations where a numpy alternative isn't always obvious. Benchmarking your code on test data is always a useful way to determine if this is working as expected.

Donor answered 26/2, 2024 at 11:3 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

pandas 2.2 update: apply now supports engine='numba'

Recommended topics

Hot tags

pandas 2.2 update: `apply` now supports `engine='numba'`