Pandas groupby: efficiently chain several functions
Asked Answered
T

3

9

I need to group a DataFrame and apply several chained functions on each group.

My problem is basically the same as in pandas - Groupby two functions: apply cumsum then shift on each group.

There are answers there on how to obtain a correct result, however they seem to have a suboptimal performance. My specific question is thus: is there a more efficient way than the ones I describe below?


First here is some large testing data:

from string import ascii_lowercase

import numpy as np
import pandas as pd


n = 100_000_000
np.random.seed(0)
df = pd.DataFrame(
    {
        "x": np.random.choice(np.array([*ascii_lowercase]), size=n),
        "y": np.random.normal(size=n),
    }
)

Below is the performance of each function:

%timeit df.groupby("x")["y"].cumsum()
4.65 s ± 71 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit df.groupby("x")["y"].shift()
5.29 s ± 54.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

A basic solution is to group twice. It seems suboptimal since grouping is a large part of the total runtime and should only be done once.

%timeit df.groupby("x")["y"].cumsum().groupby(df["x"]).shift()
10.1 s ± 63.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

The accepted answer to the aforementioned question suggests to use apply with a custom function to avoid this issue. However for some reason it is actually performing much worse than the previous solution.

def cumsum_shift(s):
    return s.cumsum().shift()

%timeit df.groupby("x")["y"].apply(cumsum_shift)
27.8 s ± 858 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Do you have any idea how to optimize this code? Especially in a case where I'd like to chain more than two functions, performance gains can become quite significant.

Tartlet answered 2/2, 2022 at 15:57 Comment(10)
I'm not really sure there's a way to do this, at least without first sorting the df by x. Also, apply is known to be slow.Matrass
Indeed sorting by x decreases the runtime of apply to ~8s. Sorting itself takes a lot of time but it's still good to know for some cases.Kofu
Do you mind having it sorted? It may be possible to combine the cumsum and shift together if you don't mind.Matrass
Does pipe help speedwise with the second optionReposition
@richardec If you have a solution that implies sorting please post it. It may help in some cases.Kofu
@Reposition What do you mean exactly? If you can suggest explicit code I can try and benchmark it.Kofu
based on your example, a cumsum is not necessary, a shift is all you need from my prelim test. the pipe idea doesnt fit in here. Do you mind adding a small example with expected output, to see if there is a way to avoid grouping twice. easier to reason, before scaling upReposition
@RoméoDesprés You can speed up the code using Numpy+Numba, but be aware that in the end the code will be significantly more complex. If your strings are quite large (eg >8), then a manual sort is certainly the best option. Operating on strings is known to be slow as opposed to categorical data (though the conversion is generally expensive). Pandas is slow here mainly because of strings and sorting (groupby does a sort internally).Itinerancy
Did you ever find a way to preserve the grouping? I have yet to find a way that does not involve 2 groupby operationsStoa
@Stoa no, still an open problem!Kofu
C
1

Your slightly skimmed reference case gives me:

%%timeit
df.groupby("x").y.cumsum().groupby(df["x"]).shift()
3.7 s ± 198 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Suggestions:

You can front-load quite some of pandas string-parsing needs by converting your grouping columns to either a MultiIndex (if you want to group on multiple columns) or a CategoricalIndex (for a single grouping column):

%%timeit
idx = pd.CategoricalIndex(df["x"])
df.y.groupby(idx).cumsum().groupby(idx).shift()
2.32 s ± 23.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Cropeared answered 22/5, 2023 at 18:33 Comment(1)
Good one! It's still suboptimal compared to what I was looking for (grouping once and chaining functions) but that's definitely a good improvement.Kofu
P
0

I would suggest to give a try to transform instead of apply

try this:

%timeit df.groupby("x")["y"].transform(np.cumsum).transform(lambda x: x.shift())

or, also try using

from toolz import pipe

%timeit df.groupby("x").pipe(lambda g: g["y"].cumsum().shift())

I am pretty sure that pipe can be more efficient than apply or transform Let us know if it works well

Pironi answered 3/1, 2023 at 22:8 Comment(1)
Pipe does not preserve the grouping so this will shift without considering the 'x' groupsStoa
R
-1

Let me know if this helps, few weeks back I was having the same issue.

I solved it by just spliting the code. And creating a separate groupby object which contains information about the groups.

# creating groupby object
g = df.groupby('x')['y']

%timeit g.cumsum()
592 ms ± 8.67 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit g.shift()
1.7 s ± 8.68 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Retral answered 18/10, 2022 at 9:50 Comment(3)
Thanks for your reply. Correct me if I'm wrong but it doesn't seem like you are chaining calls here, simply running one then the other. By chaining, I mean "apply cumsum and then apply shift on the result of cumsum"Kofu
g.cumsum().groupby(df["x"]).shift() is it considered chaining? Coz here I'm applying cumsum then shift. It takes 7.78 s ± 36.8 ms per loop and without separate groupby object takes 13.7 s ± 1.27 s per loop on my machineRetral
It's actually df.groupby("x")["y"].cumsum().groupby(df["x"]).shift() (the shift should be applied to each group individually, otherwise values from one group would shift to the next group)Kofu

© 2022 - 2024 — McMap. All rights reserved.