The following code summarizes numeric data using two different approaches.
The first approach uses the Dataframe().describe() and passes some specific extra percentiles.
The second approach separately computes the summary stats (mean, std, N), stacks it, computes the same quantiles, then appends the two and sorts by the index so the result is essentially the same as the first approach.
There are some minor naming differences that we can clean up afterword's and since the summarized data is small, that is very fast.
Turns out that using the describe function was about 8x slower in this example.
I am looking for reasons why and perhaps suggestions on any other approaches that may speed this up even more (filters, groups, values) are all passed in from UI to a tornado service - so speed is important, as the user is waiting for results, and the data can be even larger that this example.
import pandas as pd
import numpy as np
from datetime import datetime
def make_data (n):
ts = datetime.now().timestamp() + abs(np.random.normal(60, 30, n)).cumsum()
df = pd.DataFrame({
'c1': np.random.choice(list('ABCDEFGH'), n),
'c2': np.random.choice(list('ABCDEFGH'), n),
'c3': np.random.choice(list('ABCDEFGH'), n),
't1': np.random.randint(1, 20, n),
't2': pd.to_datetime(ts, unit='s'),
'x1': np.random.randn(n),
'x2': np.random.randn(n),
'x3': np.random.randn(n)
})
return df
def summarize_numeric_1 (df, mask, groups, values, quantiles):
dfg = df[mask].groupby(groups)[values]
return dfg.describe(percentiles = quantiles).stack()
def summarize_numeric_2 (df, filt, groups, values, quantiles):
dfg = df[mask].groupby(groups)[values]
dfg_stats = dfg.agg([np.mean, np.std, len]).stack()
dfg_quantiles = dfg.quantile(all_quantiles)
return dfg_stats.append(dfg_quantiles).sort_index()
%time df = make_data(1000000)
groups = ['c1', 'c2', 't1']
mask = df['c3'].eq('H') & df['c1'].eq('A')
values = ['x1', 'x3']
base_quantiles = [0, .5, 1]
extd_quantiles = [0.25, 0.75, 0.9]
all_quantiles = base_quantiles + extd_quantiles
%timeit summarize_numeric_1(df, mask, groups, values, extd_quantiles)
%timeit summarize_numeric_2(df, mask, groups, values, all_quantiles)
The timings on my PC for this are:
Using describe: 873 ms ± 8.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Using two step method: 105 ms ± 490 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
All inputs welcome!
count
is different thanlen
, I believe.len
will give you the total number of rows.count
returns the number of non-null values, – Subtractive