I have a pandas DataFrame df
for which I want to compute some statistics per batch of rows.
For example, let's say that I have a batch_size = 200000
.
For each batch of batch_size
rows I would like to have the number of unique values for a column ID
of my DataFrame.
How can I do something like that ?
Here is an example of what I want :
print(df)
>>
+-------+
| ID|
+-------+
| 1|
| 1|
| 2|
| 2|
| 2|
| 3|
| 3|
| 3|
| 3|
+-------+
batch_size = 3
my_new_function(df,batch_size)
>>
For batch 1 (0 to 2) :
2 unique values
1 appears 2 times
2 appears 1 time
For batch 2 (3 to 5) :
2 unique values
2 appears 2 times
3 appears 1 time
For batch 3 (6 to 8)
1 unique values
3 appears 3 times
Note : The output can of course be a simple DataFrame
df_batch.drop_duplicates(subset=['ID']).size()
. But still doesnt answer the question, what do you mean by batch, is it randomly 200000 rows ? – Longdrawnoutdf
and the expected output for a smallerbatch_size
(batch_size=3
) for example – Kedge