Pandas: df.groupby() is too slow for big data set. Any alternatives methods?
Asked Answered
L

4

19

I have a pandas.DataFrame with 3.8 Million rows and one column, and I'm trying to group them by index.

The index is the customer ID. I want to group the qty_liter by the index:

df = df.groupby(df.index).sum()

But it takes forever to finish the computation. Are there any alternative ways to deal with a very large data set?

Here is the df.info():

<class 'pandas.core.frame.DataFrame'>
Index: 3842595 entries, -2147153165 to \N
Data columns (total 1 columns):
qty_liter    object
dtypes: object(1)
memory usage: 58.6+ MB

The data looks like this:

enter image description here

Lir answered 22/6, 2017 at 16:5 Comment(5)
save out the index as the first column, and then this one liner in the terminal will achiever what you want awk 'BEGIN{FS=OFS=","}{a[$1]+=$2}END{ for (i in a) print i,a[i]}'Inveigh
How many unique groups do you have? Even with 3.8 million unique indices, it computes the sum in less than a second (I tried with floats).Heteromerous
Do you care about the index info in the output dataframe? Could there be any negative value in qty_liter?Hurwitz
Yes, the index is consumers' ID number. So I want to group the qty_literby the consumers' ID. the I checked the values of qty_liter, all of them are positive and float type.Lir
The problem is that neither the index nor column value are numeric - I'd check into however you created this frameQuarterly
H
8

The problem is that your data are not numeric. Processing strings takes a lot longer than processing numbers. Try this first:

df.index = df.index.astype(int)
df.qty_liter = df.qty_liter.astype(float)

Then do groupby() again. It should be much faster. If it is, see if you can modify your data loading step to have the proper dtypes from the beginning.

Hoyle answered 16/9, 2017 at 0:37 Comment(2)
What about categoricals, does this improve speed?Poultryman
Categoricals are fine, so long as the number of categories is not huge.Hoyle
B
1

Your data is classified into too many categories, which is the main reason that makes the groupby code too slow. I tried using Bodo to see how it would do with the groupby on a large data set. I ran the code with regular sequential Pandas and parallelized Bodo. It took about 20 seconds for Pandas and only 5 seconds for Bodo to run. Bodo basically parallelizes your Pandas code automatically and allows you to run it on multiple processors, which you cannot do with native pandas. It is free for up to four cores: https://docs.bodo.ai/latest/source/installation_and_setup/install.html

Notes on data generation: I generated a relatively large dataset with 20 million rows and 18 numerical columns. To make the generated data more resemblant to your dataset, two other columns named “index” and “qty_liter” are added.

#data generation

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(20000000, 18), columns = list('ABCDEFGHIJKLMNOPQR'))
df['index'] = np.random.randint(2147400000,2147500000,20000000).astype(str)
df['qty_liter'] = np.random.randn(20000000)

df.to_parquet("data.pq")

With Regular Pandas:

import time
import pandas as pd
import numpy as np

start = time.time()
df = pd.read_parquet("data.pq")
grouped = df.groupby(['index'])['qty_liter'].sum()
end = time.time()
print("computation time: ", end - start)
print(grouped.head())

output:
computation time:  19.29292106628418
index
2147400000    29.701094
2147400001    -7.164031
2147400002   -21.104117
2147400003     7.315127
2147400004   -12.661605
Name: qty_liter, dtype: float64

With Bodo:

%%px

import numpy as np
import pandas as pd
import time
import bodo

@bodo.jit(distributed = ['df'])
def group_by():
    start = time.time()
    df = pd.read_parquet("data.pq")
    df = df.groupby(['index'])['qty_liter'].sum()
    end = time.time()
    print("computation time: ", end - start)
    print(df.head())
    return df
    
df = group_by()

output:
[stdout:0] 
computation time:  5.12944599299226
index
2147437531     6.975570
2147456463     1.729212
2147447371    26.358158
2147407055    -6.885663
2147454784    -5.721883
Name: qty_liter, dtype: float64

Disclaimer: I am a data scientist advocate working at Bodo.ai

Bowrah answered 14/5, 2021 at 13:28 Comment(2)
The link to Bodo is brokenNonrigid
fixed now! thanks for letting me know!Bowrah
P
0

I do not use string, but integer values that define the groups. Still it is very slow: about 3 mins vs. a fraction of a second in Stata. The number of observations is about 113k, the number of groups defined by x, y, z is about 26k.

a= df.groupby(["x", "y", "z"])["b"].describe()[['max']]

x,y,z: integer values

b: real value

Precipitate answered 31/1, 2021 at 15:42 Comment(1)
Is this an answer or another question?Poultryman
C
0

Use categorical data type if you can't convert values to numeric:

df.astype('category')

Then when you do the groupby, set observed=True

df = df.groupby(df.index, observed=True).sum()

From the documentation: observed bool, default False

This only applies if any of the groupers are Categoricals. If True: only show observed values for categorical groupers. If False: show all values for categorical groupers.

Cowhide answered 4/8, 2022 at 19:28 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.