Pandas dataframe : Operation per batch of rows

Asked 20/5, 2019 at 9:29 Answered 20/5, 2019 at 10:39

Solved python pandas performance batch-processing

I have a pandas DataFrame df for which I want to compute some statistics per batch of rows.

For example, let's say that I have a batch_size = 200000.

For each batch of batch_size rows I would like to have the number of unique values for a column ID of my DataFrame.

How can I do something like that ?

Here is an example of what I want :

print(df)

>>
+-------+
|     ID|
+-------+
|      1|
|      1|
|      2|
|      2|
|      2|
|      3|
|      3|
|      3|
|      3|
+-------+

batch_size = 3

my_new_function(df,batch_size)

>>
For batch 1 (0 to 2) :
2 unique values 
1 appears 2 times
2 appears 1 time

For batch 2 (3 to 5) : 
2 unique values 
2 appears 2 times
3 appears 1 time

For batch 3 (6 to 8) 
1 unique values 
3 appears 3 times

Note : The output can of course be a simple DataFrame

Bark answered 20/5, 2019 at 9:29 Comment(3)

Create a df_batch and then try df_batch.groupby("ID").drop_duplicates().size() – Millymilman 20/5, 2019 at 9:32

no need to groupby ID here imo, you could use the df_batch.drop_duplicates(subset=['ID']).size(). But still doesnt answer the question, what do you mean by batch, is it randomly 200000 rows ? – Longdrawnout 20/5, 2019 at 9:35

Please post a sample input df and the expected output for a smaller batch_size (batch_size=3) for example – Kedge 20/5, 2019 at 9:35

See here for splitting the dataframe. After that I would do:

from collections import Counter
Counter(batch_df['ID'].tolist())

Reinhardt answered 20/5, 2019 at 9:44 Comment(0)

See this post for the splitting process, then you could do this to get number of unique 'ID'

df = pd.DataFrame({'ID' : [1, 1, 2, 2, 2, 3, 3, 3, 3]})
batch_size = 3
result = []
for batch_number, batch_df in df.groupby(np.arange(len(df)) // batch_size):
    result.append(batch_df['ID'].nunique())
pd.DataFrame(result)

edit: go with user3426270's answer, I didn't notice it when I answered

Longdrawnout answered 20/5, 2019 at 9:56 Comment(0)

See here for splitting the dataframe. After that I would do:

from collections import Counter
Counter(batch_df['ID'].tolist())

Reinhardt answered 20/5, 2019 at 9:44 Comment(0)

groupby using a custom aggregation function might solve your problem

import pandas as pd
import numpy as np

df = pd.DataFrame({'ID':[1,1,2,2,2,3,3,3,3], 'X':1})

batch_size = 3
batches = np.ceil(df.shape[0]/batch_size)
df.index = pd.cut(df.index,batches,labels=range(batches))

###########

def myFunc(batch_data :pd.DataFrame):
    #print(batch_data.unique(),'\n')
    return batch_data.nunique()

output1 = df.groupby(df.index).aggregate({'ID':myFunc})
output2 = df.groupby(df.index).aggregate(myFunc)
output3 = df.groupby(df.index).aggregate({'ID':myFunc,'X':'std'})

# # Output

#print(output1)
   ID
0   2
1   2
2   1

#print(output2)
   ID  X
0   2  1
1   2  1
2   1  1

#print(output3)
   ID    X
0   2  0.0
1   2  0.0
2   1  0.0

Fugitive answered 20/5, 2019 at 10:39 Comment(1)

I like this use of aggregate here. For the example to work, batches should be cast to int. I don't see the need for resetting the dataframe's index, as that could be useful info depending on your use-case. You could rather just use a separate variable index = pd.cut(...) and use that in the call to groupby. – Grade 21/6, 2022 at 14:56

Recommended topics

Hot tags