I have a dataframe with
- 5 millions of rows.
- a column
group_id
whose number of unique elements is 500.000. - thousands of other columns named
var1
,var2
, etc. Each ofvar1
,var2
, ... contains only 0 and 1.
I would like to group by group_id
and then sum them up. To have better performance, I use dask. However, the speed is still slow for this simple aggregation.
The time spent on a dataframe with 10 columns is 6.285385847091675 seconds
The time spent on a dataframe with 100 columns is 64.9060411453247 seconds
The time spent on a dataframe with 200 columns is 150.6109869480133 seconds
The time spent on a dataframe with 300 columns is 235.77087807655334 seconds
My real dataset contains up to 30.000 columns. I have read answers (1 and 2) by @Divakar about using numpy. However, the former thread is about counting and the latter is about summing columns.
Could you please elaborate on some ways to speed up this aggregation?
import numpy as np
import pandas as pd
import os, time
from multiprocessing import dummy
import dask.dataframe as dd
core = os.cpu_count()
P = dummy.Pool(processes = core)
n_docs = 500000
n_rows = n_docs * 10
data = {}
def create_col(i):
name = 'var' + str(i)
data[name] = np.random.randint(0, 2, n_rows)
n_cols = 300
P.map(create_col, range(1, n_cols + 1))
df = pd.DataFrame(data, dtype = 'int8')
df.insert(0, 'group_id', np.random.randint(1, n_docs + 1, n_rows))
df = dd.from_pandas(df, npartitions = 3 * core)
start = time.time()
df.groupby('group_id').sum().compute()
end = time.time()
print('The time spent on a dataframe with {} columns is'.format(n_cols), end - start, 'seconds')
pd.get_dummies
on an original dataframe with 25 columns. – Mithridatism