When should I (not) want to use pandas apply() in my code?
Asked Answered
H

5

233

I have seen many answers posted to questions on Stack Overflow involving the use of the Pandas method apply. I have also seen users commenting under them saying that "apply is slow, and should be avoided".

I have read many articles on the topic of performance that explain apply is slow. I have also seen a disclaimer in the docs about how apply is simply a convenience function for passing UDFs (can't seem to find that now). So, the general consensus is that apply should be avoided if possible. However, this raises the following questions:

  1. If apply is so bad, then why is it in the API?
  2. How and when should I make my code apply-free?
  3. Are there ever any situations where apply is good (better than other possible solutions)?
Heterosexuality answered 30/1, 2019 at 2:34 Comment(3)
returns.add(1).apply(np.log) vs. np.log(returns.add(1) is a case where apply will generally be marginally faster, which is the bottom right green box in jpp's diagram below.Goulash
@Goulash thanks. Did not exhaustively point out these situations, but they are useful to know!Heterosexuality
Apply is fast enough and a great API 80% of the time. So I heartily disagree with the sentiments that suggest not to use it. But it's definitely good to be aware of its limitations and have some of the tricks outlined in the top answer in your back pocket, in case indeed apply ends up being too slow.Omen
H
257

apply, the Convenience Function you Never Needed

We start by addressing the questions in the OP, one by one.

"If apply is so bad, then why is it in the API?"

DataFrame.apply and Series.apply are convenience functions defined on DataFrame and Series object respectively. apply accepts any user defined function that applies a transformation/aggregation on a DataFrame. apply is effectively a silver bullet that does whatever any existing pandas function cannot do.

Some of the things apply can do:

  • Run any user-defined function on a DataFrame or Series
  • Apply a function either row-wise (axis=1) or column-wise (axis=0) on a DataFrame
  • Perform index alignment while applying the function
  • Perform aggregation with user-defined functions (however, we usually prefer agg or transform in these cases)
  • Perform element-wise transformations
  • Broadcast aggregated results to original rows (see the result_type argument).
  • Accept positional/keyword arguments to pass to the user-defined functions.

...Among others. For more information, see Row or Column-wise Function Application in the documentation.

So, with all these features, why is apply bad? It is because apply is slow. Pandas makes no assumptions about the nature of your function, and so iteratively applies your function to each row/column as necessary. Additionally, handling all of the situations above means apply incurs some major overhead at each iteration. Further, apply consumes a lot more memory, which is a challenge for memory bounded applications.

There are very few situations where apply is appropriate to use (more on that below). If you're not sure whether you should be using apply, you probably shouldn't.



pandas 2.2 update: apply now supports engine='numba'

More info in the release notes as well as GH54666

Choose between the python (default) engine or the numba engine in apply.

The numba engine will attempt to JIT compile the passed function, which may result in speedups for large DataFrames. It also supports the following engine_kwargs :

  • nopython (compile the function in nopython mode)
  • nogil (release the GIL inside the JIT compiled function)
  • parallel (try to apply the function in parallel over the DataFrame)

Note: Due to limitations within numba/how pandas interfaces with numba, you should only use this if raw=True


Let's address the next question.

"How and when should I make my code apply-free?"

To rephrase, here are some common situations where you will want to get rid of any calls to apply.

Numeric Data

If you're working with numeric data, there is likely already a vectorized cython function that does exactly what you're trying to do (if not, please either ask a question on Stack Overflow or open a feature request on GitHub).

Contrast the performance of apply for a simple addition operation.

df = pd.DataFrame({"A": [9, 4, 2, 1], "B": [12, 7, 5, 4]})
df

   A   B
0  9  12
1  4   7
2  2   5
3  1   4

<!- ->

df.apply(np.sum)

A    16
B    28
dtype: int64

df.sum()

A    16
B    28
dtype: int64

Performance wise, there's no comparison, the cythonized equivalent is much faster. There's no need for a graph, because the difference is obvious even for toy data.

%timeit df.apply(np.sum)
%timeit df.sum()
2.22 ms ± 41.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
471 µs ± 8.16 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Even if you enable passing raw arrays with the raw argument, it's still twice as slow.

%timeit df.apply(np.sum, raw=True)
840 µs ± 691 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Another example:

df.apply(lambda x: x.max() - x.min())

A    8
B    8
dtype: int64

df.max() - df.min()

A    8
B    8
dtype: int64

%timeit df.apply(lambda x: x.max() - x.min())
%timeit df.max() - df.min()

2.43 ms ± 450 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
1.23 ms ± 14.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In general, seek out vectorized alternatives if possible.


String/Regex

Pandas provides "vectorized" string functions in most situations, but there are rare cases where those functions do not... "apply", so to speak.

A common problem is to check whether a value in a column is present in another column of the same row.

df = pd.DataFrame({
    'Name': ['mickey', 'donald', 'minnie'],
    'Title': ['wonderland', "welcome to donald's castle", 'Minnie mouse clubhouse'],
    'Value': [20, 10, 86]})
df

     Name  Value                       Title
0  mickey     20                  wonderland
1  donald     10  welcome to donald's castle
2  minnie     86      Minnie mouse clubhouse

This should return the row second and third row, since "donald" and "minnie" are present in their respective "Title" columns.

Using apply, this would be done using

df.apply(lambda x: x['Name'].lower() in x['Title'].lower(), axis=1)

0    False
1     True
2     True
dtype: bool
 
df[df.apply(lambda x: x['Name'].lower() in x['Title'].lower(), axis=1)]

     Name                       Title  Value
1  donald  welcome to donald's castle     10
2  minnie      Minnie mouse clubhouse     86

However, a better solution exists using list comprehensions.

df[[y.lower() in x.lower() for x, y in zip(df['Title'], df['Name'])]]

     Name                       Title  Value
1  donald  welcome to donald's castle     10
2  minnie      Minnie mouse clubhouse     86

<!- ->

%timeit df[df.apply(lambda x: x['Name'].lower() in x['Title'].lower(), axis=1)]
%timeit df[[y.lower() in x.lower() for x, y in zip(df['Title'], df['Name'])]]

2.85 ms ± 38.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
788 µs ± 16.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

The thing to note here is that iterative routines happen to be faster than apply, because of the lower overhead. If you need to handle NaNs and invalid dtypes, you can build on this using a custom function you can then call with arguments inside the list comprehension.

For more information on when list comprehensions should be considered a good option, see my writeup: Are for-loops in pandas really bad? When should I care?.

Note
Date and datetime operations also have vectorized versions. So, for example, you should prefer pd.to_datetime(df['date']), over, say, df['date'].apply(pd.to_datetime).

Read more at the docs.


A Common Pitfall: Exploding Columns of Lists

s = pd.Series([[1, 2]] * 3)
s

0    [1, 2]
1    [1, 2]
2    [1, 2]
dtype: object

People are tempted to use apply(pd.Series). This is horrible in terms of performance.

s.apply(pd.Series)

   0  1
0  1  2
1  1  2
2  1  2

A better option is to listify the column and pass it to pd.DataFrame.

pd.DataFrame(s.tolist())

   0  1
0  1  2
1  1  2
2  1  2

<!- ->

%timeit s.apply(pd.Series)
%timeit pd.DataFrame(s.tolist())

2.65 ms ± 294 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
816 µs ± 40.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


Lastly,

"Are there any situations where apply is good?"

Apply is a convenience function, so there are situations where the overhead is negligible enough to forgive. It really depends on how many times the function is called.

Functions that are Vectorized for Series, but not DataFrames
What if you want to apply a string operation on multiple columns? What if you want to convert multiple columns to datetime? These functions are vectorized for Series only, so they must be applied over each column that you want to convert/operate on.

df = pd.DataFrame(
         pd.date_range('2018-12-31','2019-01-31', freq='2D').date.astype(str).reshape(-1, 2), 
         columns=['date1', 'date2'])
df

       date1      date2
0 2018-12-31 2019-01-02
1 2019-01-04 2019-01-06
2 2019-01-08 2019-01-10
3 2019-01-12 2019-01-14
4 2019-01-16 2019-01-18
5 2019-01-20 2019-01-22
6 2019-01-24 2019-01-26
7 2019-01-28 2019-01-30

df.dtypes

date1    object
date2    object
dtype: object
    

This is an admissible case for apply:

df.apply(pd.to_datetime, errors='coerce').dtypes

date1    datetime64[ns]
date2    datetime64[ns]
dtype: object

Note that it would also make sense to stack, or just use an explicit loop. All these options are slightly faster than using apply, but the difference is small enough to forgive.

%timeit df.apply(pd.to_datetime, errors='coerce')
%timeit pd.to_datetime(df.stack(), errors='coerce').unstack()
%timeit pd.concat([pd.to_datetime(df[c], errors='coerce') for c in df], axis=1)
%timeit for c in df.columns: df[c] = pd.to_datetime(df[c], errors='coerce')

5.49 ms ± 247 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
3.94 ms ± 48.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
3.16 ms ± 216 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
2.41 ms ± 1.71 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

You can make a similar case for other operations such as string operations, or conversion to category.

u = df.apply(lambda x: x.str.contains(...))
v = df.apply(lambda x: x.astype(category))

v/s

u = pd.concat([df[c].str.contains(...) for c in df], axis=1)
v = df.copy()
for c in df:
    v[c] = df[c].astype(category)

And so on...


Converting Series to str: astype versus apply

This seems like an idiosyncrasy of the API. Using apply to convert integers in a Series to string is comparable (and sometimes faster) than using astype.

enter image description here The graph was plotted using the perfplot library.

import perfplot

perfplot.show(
    setup=lambda n: pd.Series(np.random.randint(0, n, n)),
    kernels=[
        lambda s: s.astype(str),
        lambda s: s.apply(str)
    ],
    labels=['astype', 'apply'],
    n_range=[2**k for k in range(1, 20)],
    xlabel='N',
    logx=True,
    logy=True,
    equality_check=lambda x, y: (x == y).all())

With floats, I see the astype is consistently as fast as, or slightly faster than apply. So this has to do with the fact that the data in the test is integer type.


GroupBy operations with chained transformations

GroupBy.apply has not been discussed until now, but GroupBy.apply is also an iterative convenience function to handle anything that the existing GroupBy functions do not.

One common requirement is to perform a GroupBy and then two prime operations such as a "lagged cumsum":

df = pd.DataFrame({"A": list('aabcccddee'), "B": [12, 7, 5, 4, 5, 4, 3, 2, 1, 10]})
df

   A   B
0  a  12
1  a   7
2  b   5
3  c   4
4  c   5
5  c   4
6  d   3
7  d   2
8  e   1
9  e  10

<!- ->

You'd need two successive groupby calls here:

df.groupby('A').B.cumsum().groupby(df.A).shift()
 
0     NaN
1    12.0
2     NaN
3     NaN
4     4.0
5     9.0
6     NaN
7     3.0
8     NaN
9     1.0
Name: B, dtype: float64

Using apply, you can shorten this to a a single call.

df.groupby('A').B.apply(lambda x: x.cumsum().shift())

0     NaN
1    12.0
2     NaN
3     NaN
4     4.0
5     9.0
6     NaN
7     3.0
8     NaN
9     1.0
Name: B, dtype: float64

It is very hard to quantify the performance because it depends on the data. But in general, apply is an acceptable solution if the goal is to reduce a groupby call (because groupby is also quite expensive).



Other Caveats

Aside from the caveats mentioned above, it is also worth mentioning that apply operates on the first row (or column) twice. This is done to determine whether the function has any side effects. If not, apply may be able to use a fast-path for evaluating the result, else it falls back to a slow implementation.

df = pd.DataFrame({
    'A': [1, 2],
    'B': ['x', 'y']
})

def func(x):
    print(x['A'])
    return x

df.apply(func, axis=1)

# 1
# 1
# 2
   A  B
0  1  x
1  2  y

This behaviour is also seen in GroupBy.apply on pandas versions <0.25 (it was fixed for 0.25, see here for more information.)

Heterosexuality answered 30/1, 2019 at 2:34 Comment(8)
I think we need to be careful.. with %timeit for c in df.columns: df[c] = pd.to_datetime(df[c], errors='coerce') surely after the first iteration it'll be much quicker since you're converting datetime to ... datetime?Deianira
@Deianira I had the same concern. But you still need to do a linear scan either way, calling to_datetime on strings is as fast as calling them on datetime objects, if not faster. The ballpark timings are the same. The alternative would be to implement some pre-copy step for every timed solution which takes away from the main point. But it is a valid concern.Heterosexuality
"Calling to_datetime on strings is as fast as on ... datetime objects" .. really? I included dataframe creation (fixed cost) in apply vs for loop timings and the difference is much smaller.Deianira
@Deianira Well, that's what I got from my (admittedly limited) testing. I'm sure it depends on the data, but the general idea is that for the purpose of illustration, the difference is "seriously, don't worry about it".Heterosexuality
@cs95 , I have found apply and list comprehensions are almost equally fast . check this repo: github.com/tseth92/pandas_experiments/blob/master/… , they are way much faster than iterrows and iterloops. I am not comparing with cythonized vectors, but if its a particular customized function , like in the repo, should i consider list comps over apply? If yes, then why since they seem to be almost equally fasterEpicotyl
I think another answer to "Are there any situations where apply is good?" is illustrated by this very answer. Notice that in general, the solutions not using apply are significantly more complex -and thus error prone- compared to just not thinking about it and using apply. Thus as in software development -and in general- life, you probably want to apply the 80-20 rule. 80% of the time using apply is preferred. But in the 20% of the time that the result is too slow, you can go ahead and optimize away from apply.Omen
Thank for the detailed post! Is it still valid though? I can't seem to find the same timings as you in the comparison between df.apply(np.sum) vs df.sum() with pandas 1.3.2: df.sum() is only ~20% faster than df.apply(np.sum) and df.apply(np.sum, raw=True) is twice as fast! I couldn't find performance improvements specific to apply in the changelog so I'm a bit lost...Brainwash
@Brainwash this might need to be revised a bit as implementation details are prone to change timeits without prior noticeHeterosexuality
D
83

Not all applys are alike

The below chart suggests when to consider apply1. Green means possibly efficient; red avoid.

enter image description here

Some of this is intuitive: pd.Series.apply is a Python-level row-wise loop, ditto pd.DataFrame.apply row-wise (axis=1). The misuses of these are many and wide-ranging. The other post deals with them in more depth. Popular solutions are to use vectorised methods, list comprehensions (assumes clean data), or efficient tools such as the pd.DataFrame constructor (e.g. to avoid apply(pd.Series)).

If you are using pd.DataFrame.apply row-wise, specifying raw=True (where possible) is often beneficial. At this stage, numba is usually a better choice.

GroupBy.apply: generally favoured

Repeating groupby operations to avoid apply will hurt performance. GroupBy.apply is usually fine here, provided the methods you use in your custom function are themselves vectorised. Sometimes there is no native Pandas method for a groupwise aggregation you wish to apply. In this case, for a small number of groups apply with a custom function may still offer reasonable performance.

pd.DataFrame.apply column-wise: a mixed bag

pd.DataFrame.apply column-wise (axis=0) is an interesting case. For a small number of rows versus a large number of columns, it's almost always expensive. For a large number of rows relative to columns, the more common case, you may sometimes see significant performance improvements using apply:

# Python 3.7, Pandas 0.23.4
np.random.seed(0)
df = pd.DataFrame(np.random.random((10**7, 3)))     # Scenario_1, many rows
df = pd.DataFrame(np.random.random((10**4, 10**3))) # Scenario_2, many columns

                                               # Scenario_1  | Scenario_2
%timeit df.sum()                               # 800 ms      | 109 ms
%timeit df.apply(pd.Series.sum)                # 568 ms      | 325 ms

%timeit df.max() - df.min()                    # 1.63 s      | 314 ms
%timeit df.apply(lambda x: x.max() - x.min())  # 838 ms      | 473 ms

%timeit df.mean()                              # 108 ms      | 94.4 ms
%timeit df.apply(pd.Series.mean)               # 276 ms      | 233 ms

1 There are exceptions, but these are usually marginal or uncommon. A couple of examples:

  1. df['col'].apply(str) may slightly outperform df['col'].astype(str).
  2. df.apply(pd.to_datetime) working on strings doesn't scale well with rows versus a regular for loop.
Deianira answered 30/1, 2019 at 4:53 Comment(6)
@coldspeed, Thanks, there's nothing much wrong with your post (apart from some contradictory benchmarking vs mine, but could be input or setup based). Just felt there's a different way to look at the problem.Deianira
@Deianira I always used you're excellent flowchart as guidance until when I saw today that a row-wise apply is significantly faster than my solution with any. Any thoughts on this?Barri
@Stef, How many rows of data are you looking at? Construct a dataframe with 1mio+ rows and try comparing the logic, the apply should be slower. Also note the problem may be mask (try using np.where instead). A process which takes 3-5 milliseconds isn't good for benchmarking purposes, since in reality you probably don't care for performance when times are so small.Deianira
@jpp: you're right: for 1mio rows x 100 cols any is about 100 times faster than apply. It did my first tests with 2000 rows x 1000 cols and here apply was twice as fast as anyBarri
@Deianira I would like to use your image in a presentation / article. Are you okay with that? I will obviously mention the source. ThanksAshley
@Erfan, Sure, go ahead.Deianira
C
9

For axis=1 (i.e. row-wise functions) then you can just use the following function in lieu of apply. I wonder why this isn't the pandas behavior. (Untested with compound indexes, but it does appear to be much faster than apply)

def faster_df_apply(df, func):
    cols = list(df.columns)
    data, index = [], []
    for row in df.itertuples(index=True):
        row_dict = {f:v for f,v in zip(cols, row[1:])}
        data.append(func(row_dict))
        index.append(row[0])
    return pd.Series(data, index=index)
Correlation answered 20/5, 2019 at 2:34 Comment(7)
I was very surprised to find this gave me better performance in some cases. It was especially useful when I needed to do multiple things, each with a different subset of column values. The "All applys aren't alike" answer might help figure out when it is likely to help but it is not super difficult to test on a sample of your data.Supposing
A few pointers: for performance a list comprehension would outperform the for loop; zip(df, row[1:]) is sufficient here; really, at this stage, consider numba if func is a numeric calculation. See this answer for an explanation.Deianira
@Deianira - if you have a better function please share. I think this is pretty close to optimal from my analysis. Yes numba is faster, faster_df_apply is meant for people that just want something equivalent to, but faster than, the DataFrame.apply (which is weirdly slow).Correlation
This is actually very close to how .apply is implemented, but it does one thing that significantly slows it down, it essentially does: row = pd.Series({f:v for f,v in zip(cols, row[1:])}) which adds a lot of drag. I wrote an answer that described the implementation, albeit, I think it's outdated, recent versions have tried to leverage Cython in .apply, I believe (don't quote me on that)Emogeneemollient
@Emogeneemollient that explains it perfectly! Thanks so much.Correlation
Why make a new index? And why not just call dict directly on zip instead of doing the comprehension? And why bother converting the columns to a list instead of iterating over the columns?Loyola
@DanielGibson - I can't make heads or tails of any of your questions. My point is df.apply(func, axis=1) will return the same thing as faster_df_apply(df, func), but it will run much faster on DataFrames with a great many rows. If you have a better solution, please share it. I think telling people "just don't call apply" (as other people have done above) is a silly non-solution. Some people really want to call apply, and faster_df_apply is an exact substitute that runs faster.Correlation
T
1

Are there ever any situations where apply is good? Yes, sometimes.

Task: decode Unicode strings.

import numpy as np
import pandas as pd
import unidecode

s = pd.Series(['mañana','Ceñía'])
s.head()
0    mañana
1     Ceñía


s.apply(unidecode.unidecode)
0    manana
1     Cenia

Update
I was by no means advocating for the use of apply, just thinking since the NumPy cannot deal with the above situation, it could have been a good candidate for pandas apply. But I was forgetting the plain ol list comprehension thanks to the reminder by @jpp.

Toxicogenic answered 23/2, 2019 at 16:11 Comment(2)
Well, no. How is this better than [unidecode.unidecode(x) for x in s] or list(map(unidecode.unidecode, s))?Deianira
Since it was already a pandas series, I was tempted to use apply, Yeah you are right, its better to use list-comp than apply, But downvote was little harsh, I was not advocating for apply, just thought this could have been a good use case.Toxicogenic
G
0

I'd like to add another reason for not using apply that was not mentioned in other answers. apply can cause nasty mutation bugs, i.e. df.apply(func, axis=1) applies the same object to the given function. Try this:

import pandas as pd

df = pd.DataFrame({"A": [9, 4, 2, 1], "B": [12, 7, 5, 4]})
print(df.apply(id, axis=1))
# 0    2771319967472
# 1    2771319967472
# 2    2771319967472
# 3    2771319967472

(the id is the same for all values)

The documentation for apply states:

Functions that mutate the passed object can produce unexpected behavior or errors and are not supported.

But even knowing that, and avoiding mutations, is not enough. Consider the following code:

from asyncio import run, gather, sleep
import pandas as pd


async def afunc(row):
    await sleep(1)
    print(row)


async def main():
    df = pd.DataFrame({"A": [9, 4, 2, 1], "B": [12, 7, 5, 4]})
    coros = df.apply(afunc, axis=1)
    await gather(*coros)


run(main())

As you see, we have not mutated any row in our afunc. But if you run the code, you'll notice that only the last row gets applied...

Another example without async:

import pandas as pd

applied = []


def func(row):
    applied.append(row)


df = pd.DataFrame({"A": [9, 4, 2, 1], "B": [12, 7, 5, 4]})
df.apply(func, axis=1)
print(pd.DataFrame(applied))
#    A  B
# 3  1  4
# 3  1  4
# 3  1  4
# 3  1  4
Gilberte answered 31/3 at 17:6 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.