How to keep original index of a DataFrame after groupby 2 columns?
Asked Answered
S

4

22

Is there any way I can retain the original index of my large dataframe after I perform a groupby? The reason I need to this is because I need to do an inner merge back to my original df (after my groupby) to regain those lost columns. And the index value is the only 'unique' column to perform the merge back into. Does anyone know how I can achieve this?

My DataFrame is quite large. My groupby looks like this:

df.groupby(['col1', 'col2']).agg({'col3': 'count'}).reset_index()

This drops my original indexes from my original dataframe, which I want to keep.

Stadiometer answered 11/3, 2018 at 3:31 Comment(2)
When you group data, what index do you want each row to have? It's likely each group will combine many rows in the original dataframe.. Do you, for example, expect a list of indices relating to the group?Steffi
yes! that is what I'm looking forStadiometer
O
15

I think you are are looking for transform in this situation:

df['count'] = df.groupby(['col1', 'col2'])['col3'].transform('count')
Oscine answered 11/3, 2018 at 4:17 Comment(5)
according to the comments, he wants to know which indices contributed to each groupSteffi
This seems to be the most optimal solution out there as of Pandas 0.25.1Incredible
is there a simple solution like this with performance similar to losing the index without transform? i'm using groups.last() and groups.transform('last') is much slower. pandas 1.3.4Boatbill
@jayen I don't understand your question here.Oscine
groups.transform('last') is slow. groups.last() is fast. is there a fast way to use groupby and keep the index? i managed to do it by copying the index to a column and then restoring it after the groupby. almost as fast as groups.last()Boatbill
S
21

You can elevate your index to a column via reset_index. Then aggregate your index to a tuple via agg, together with your count aggregation.

Below is a minimal example.

import pandas as pd, numpy as np

df = pd.DataFrame(np.random.randint(0, 4, (50, 5)),
                  index=np.random.randint(0, 4, 50))

df = df.reset_index()

res = df.groupby([0, 1]).agg({2: 'count', 'index': tuple}).reset_index()

#     0  1  2            index
# 0   0  0  4     (2, 0, 0, 2)
# 1   0  1  4     (0, 3, 1, 1)
# 2   0  2  1             (1,)
# 3   0  3  1             (3,)
# 4   1  0  4     (1, 2, 1, 3)
# 5   1  1  2           (1, 3)
# 6   1  2  4     (2, 1, 2, 2)
# 7   1  3  1             (2,)
# 8   2  0  5  (0, 3, 0, 2, 2)
# 9   2  1  2           (0, 2)
# 10  2  2  5  (1, 1, 3, 3, 2)
# 11  2  3  2           (0, 1)
# 12  3  0  4     (0, 3, 3, 3)
# 13  3  1  4     (1, 3, 0, 1)
# 14  3  2  3        (3, 2, 1)
# 15  3  3  4     (3, 3, 2, 1)
Steffi answered 11/3, 2018 at 3:44 Comment(2)
For clarity: .agg({'index': tuple}) works on the 'index' column of df (this is a column after df.reset_index()), and put its values, got when performing the grouping, as a tuple in resulting 'index' column of res DataFrameHaematite
Depending on your DataFrame size, it may be worth using set instead of tuple since indices are unique anyway and their order usually doesn't matter (contrary to the example used in this answer)Mowbray
O
15

I think you are are looking for transform in this situation:

df['count'] = df.groupby(['col1', 'col2'])['col3'].transform('count')
Oscine answered 11/3, 2018 at 4:17 Comment(5)
according to the comments, he wants to know which indices contributed to each groupSteffi
This seems to be the most optimal solution out there as of Pandas 0.25.1Incredible
is there a simple solution like this with performance similar to losing the index without transform? i'm using groups.last() and groups.transform('last') is much slower. pandas 1.3.4Boatbill
@jayen I don't understand your question here.Oscine
groups.transform('last') is slow. groups.last() is fast. is there a fast way to use groupby and keep the index? i managed to do it by copying the index to a column and then restoring it after the groupby. almost as fast as groups.last()Boatbill
B
0

You should not use 'reset_index()' if you want to keep your original indexes

Biddable answered 11/3, 2018 at 3:34 Comment(2)
that doesn't work, even if the reset_index() is not there, the groupby does not retain the original indexesStadiometer
you are correct, it won't solve the problem. My bad. Let me see if I can find any solution.Biddable
P
0

To actually get the index, you need to do

df['count'] = df.groupby(['col1', 'col2'])['col3'].transform('idxmin') # for first occurrence, idxmax for last occurrence

N.B if your agg column is a datetime, you may get dates instead of the integer index: reference. issue with older versions of pandas.

Workaround this by converting the datetime column to int

df['date_col'] = df['date_col'].apply(lambda x:x.toordinal())
Phalarope answered 1/11, 2022 at 17:59 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.