Pandas Dataframe groupby describe 8x ~slower than computing separatly
Asked Answered
D

2

16

The following code summarizes numeric data using two different approaches.

The first approach uses the Dataframe().describe() and passes some specific extra percentiles.

The second approach separately computes the summary stats (mean, std, N), stacks it, computes the same quantiles, then appends the two and sorts by the index so the result is essentially the same as the first approach.

There are some minor naming differences that we can clean up afterword's and since the summarized data is small, that is very fast.

Turns out that using the describe function was about 8x slower in this example.

I am looking for reasons why and perhaps suggestions on any other approaches that may speed this up even more (filters, groups, values) are all passed in from UI to a tornado service - so speed is important, as the user is waiting for results, and the data can be even larger that this example.

    import pandas as pd
    import numpy as np
    from datetime import datetime

    def make_data (n):
        
        ts = datetime.now().timestamp() + abs(np.random.normal(60, 30, n)).cumsum()
        
        df = pd.DataFrame({
            'c1': np.random.choice(list('ABCDEFGH'), n),
            'c2': np.random.choice(list('ABCDEFGH'), n),
            'c3': np.random.choice(list('ABCDEFGH'), n),
            't1': np.random.randint(1, 20, n),
            't2': pd.to_datetime(ts, unit='s'),
            'x1': np.random.randn(n),
            'x2': np.random.randn(n),
            'x3': np.random.randn(n)
            })
        
        return df
    
    def summarize_numeric_1 (df, mask, groups, values, quantiles): 
        
        dfg = df[mask].groupby(groups)[values]
        
        return dfg.describe(percentiles = quantiles).stack()
    
    def summarize_numeric_2 (df, filt, groups, values, quantiles): 
           
        dfg = df[mask].groupby(groups)[values]
    
        dfg_stats = dfg.agg([np.mean, np.std, len]).stack()
        dfg_quantiles = dfg.quantile(all_quantiles)
        
        return dfg_stats.append(dfg_quantiles).sort_index()

    %time df = make_data(1000000)
    
    groups = ['c1', 'c2', 't1']
    mask = df['c3'].eq('H') & df['c1'].eq('A')
    values = ['x1', 'x3']
    base_quantiles = [0, .5, 1] 
    extd_quantiles = [0.25, 0.75, 0.9]
    all_quantiles = base_quantiles + extd_quantiles
    
    %timeit summarize_numeric_1(df, mask, groups, values, extd_quantiles)
    %timeit summarize_numeric_2(df, mask, groups, values, all_quantiles)

The timings on my PC for this are:

Using describe: 873 ms ± 8.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Using two step method: 105 ms ± 490 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

All inputs welcome!

Divertimento answered 24/6, 2018 at 22:10 Comment(6)
I have the same problem. It is even slower than 8 times doing them separately one by one. I am very suspicious that they implemented them as just working logic; not optimized for runtime.Equilibrist
Tried just now and got similar results. Still no update about that.Woken
Since .describe() include .count() but not your two step method, we might see the same problem than reported by stackoverflow.com/questions/63314312Janes
The two step method included the count by way of using the len function. I also tried just now to use 'count' in the list of functions in agg(), and still see the describe based method was 13.9 times slower than separately computing the same quantiles and group stats. This is with pandas 1.1.0.Divertimento
count is different than len, I believe. len will give you the total number of rows. count returns the number of non-null values,Subtractive
Yup, len and count are indeed different. However, either way, it was still much faster to summarize data than groupby().describe()Divertimento
I
1

Educated guess

I will post this as an answer, maybe to be deleted later, since it is more of an educated guess than an actual answer. Also it is a bit too long for a comment.

So the first thing I did after reading your answer was to re-run your timings in a profiler to take a closer look at the issue. Since the time of the computation itself was rather short, it got overshadowed quite bit by the data generation. However in general the times were similar to what you described. Not only that, the difference got even more pronounced:
1094ms for the first approach vs 63ms for the second one. This makes for the factor of 17 difference.

Since the lower of the times was rather small, I decided it was too small to be trusted and re-run the test with *10 generated data sample size. It upped the data generation step to one minute, and the numbers got weird: 1173ms for the first approach vs 506ms for the second one. Factor only slightly worse than two.

I started to suspect something. To confirm my suspicion I run one last test once again increasing the data size by the factor of 10. Result might surprise you:
12258ms for the first approach vs 3646ms for the second one. The tables have turned completely, the factor being ~0.3.

My guess in this case is that pandas computation is actually the one with better optimization/algorithm. However since it is pandas, it has quite a bit of additional baggage around it - the price that is paid for the convenience and robustness. This means that there is a layer of "unnecessary" (computation wise) baggage, that needs to be carried around no matter how large the data set is.

So in case you want to be faster than pandas even on data sets of your size, take their operations and write them yourself - in the most straightforward manner possible. This will keep their optimization & discard the baggage paid for convenience.

Infliction answered 23/11, 2020 at 14:11 Comment(0)
C
1

Note: this answer is for pandas version 1.0.5 . Things might be different for other versions.

tl;dr

pandas describe() method will always be slower than your version, because under the hood it's using almost the exact same logic, plus some other things like making sure the data has the right dimensions, ordering the results, and checking for NaNs and the correct data types.


Longer Answer

Taking a look at the source code of the describe method, we can see a few things:

  • pandas uses the same logic as your code to compute the statistics. See this line inside the describe() method for an example of how it's using the same logic. This means that pandas describe will always be slower.
  • pandas counts non-NaN values using s.count(), but your code counts all values. Let's try and modify your code to use that same method instead of len():
def summarize_numeric_3(df, filt, groups, values, quantiles): 
    dfg = df[mask].groupby(groups)[values]
    dfg_stats = dfg.agg([np.mean, np.std, pd.Series.count]).stack()
    dfg_quantiles = dfg.quantile(all_quantiles)
    return dfg_stats.append(dfg_quantiles).sort_index()

%timeit -n 10 summarize_numeric_3(df, mask, groups, values, all_quantiles)

# outputs
# 48.9 ms ± 1.11 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

it takes ~49 ms, vs 42 ms for your version on my machine. It took 7 more ms just for this relatively small modification!

  • pandas does a lot more than your code to ensure the data has the correct type and shape, and to present it in a pretty format. I've extracted the pandas describe method code into a "self-contained"* version that you can profile and tinker with here (too long to be posted in this answer). Profiling that, I see that a very large portion of the time is taken to "set a convenient order for rows". Removing that ordering improved the describe timing by ~8%, from 530 ms down to 489 ms.
Cid answered 29/11, 2020 at 17:53 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.