Your data is classified into too many categories, which is the main reason that makes the groupby
code too slow. I tried using Bodo to see how it would do with the groupby
on a large data set. I ran the code with regular sequential Pandas and parallelized Bodo. It took about 20 seconds for Pandas and only 5 seconds for Bodo to run. Bodo basically parallelizes your Pandas code automatically and allows you to run it on multiple processors, which you cannot do with native pandas. It is free for up to four cores: https://docs.bodo.ai/latest/source/installation_and_setup/install.html
Notes on data generation: I generated a relatively large dataset with 20 million rows and 18 numerical columns. To make the generated data more resemblant to your dataset, two other columns named “index” and “qty_liter” are added.
#data generation
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(20000000, 18), columns = list('ABCDEFGHIJKLMNOPQR'))
df['index'] = np.random.randint(2147400000,2147500000,20000000).astype(str)
df['qty_liter'] = np.random.randn(20000000)
df.to_parquet("data.pq")
With Regular Pandas:
import time
import pandas as pd
import numpy as np
start = time.time()
df = pd.read_parquet("data.pq")
grouped = df.groupby(['index'])['qty_liter'].sum()
end = time.time()
print("computation time: ", end - start)
print(grouped.head())
output:
computation time: 19.29292106628418
index
2147400000 29.701094
2147400001 -7.164031
2147400002 -21.104117
2147400003 7.315127
2147400004 -12.661605
Name: qty_liter, dtype: float64
With Bodo:
%%px
import numpy as np
import pandas as pd
import time
import bodo
@bodo.jit(distributed = ['df'])
def group_by():
start = time.time()
df = pd.read_parquet("data.pq")
df = df.groupby(['index'])['qty_liter'].sum()
end = time.time()
print("computation time: ", end - start)
print(df.head())
return df
df = group_by()
output:
[stdout:0]
computation time: 5.12944599299226
index
2147437531 6.975570
2147456463 1.729212
2147447371 26.358158
2147407055 -6.885663
2147454784 -5.721883
Name: qty_liter, dtype: float64
Disclaimer: I am a data scientist advocate working at Bodo.ai
awk 'BEGIN{FS=OFS=","}{a[$1]+=$2}END{ for (i in a) print i,a[i]}'
– Inveighqty_liter
? – Hurwitzqty_liter
by the consumers' ID. the I checked the values ofqty_liter
, all of them are positive and float type. – Lir