How to keep original index of a DataFrame after groupby 2 columns?

S

4

22

Is there any way I can retain the original index of my large dataframe after I perform a groupby? The reason I need to this is because I need to do an inner merge back to my original df (after my groupby) to regain those lost columns. And the index value is the only 'unique' column to perform the merge back into. Does anyone know how I can achieve this?

My DataFrame is quite large. My groupby looks like this:

df.groupby(['col1', 'col2']).agg({'col3': 'count'}).reset_index()

This drops my original indexes from my original dataframe, which I want to keep.

Stadiometer answered 11/3, 2018 at 3:31 Comment(2)

When you group data, what index do you want each row to have? It's likely each group will combine many rows in the original dataframe.. Do you, for example, expect a list of indices relating to the group? – Steffi 11/3, 2018 at 3:37

yes! that is what I'm looking for – Stadiometer 11/3, 2018 at 3:40

O

15

I think you are are looking for transform in this situation:

df['count'] = df.groupby(['col1', 'col2'])['col3'].transform('count')

Oscine answered 11/3, 2018 at 4:17 Comment(5)

according to the comments, he wants to know which indices contributed to each group – Steffi 11/3, 2018 at 4:24

This seems to be the most optimal solution out there as of Pandas 0.25.1 – Incredible 28/8, 2019 at 19:17

is there a simple solution like this with performance similar to losing the index without transform? i'm using groups.last() and groups.transform('last') is much slower. pandas 1.3.4 – Boatbill 21/10, 2021 at 10:55

@jayen I don't understand your question here. – Oscine 21/10, 2021 at 12:24

groups.transform('last') is slow. groups.last() is fast. is there a fast way to use groupby and keep the index? i managed to do it by copying the index to a column and then restoring it after the groupby. almost as fast as groups.last() – Boatbill 21/10, 2021 at 14:2

S

21

You can elevate your index to a column via reset_index. Then aggregate your index to a tuple via agg, together with your count aggregation.

Below is a minimal example.

import pandas as pd, numpy as np

df = pd.DataFrame(np.random.randint(0, 4, (50, 5)),
                  index=np.random.randint(0, 4, 50))

df = df.reset_index()

res = df.groupby([0, 1]).agg({2: 'count', 'index': tuple}).reset_index()

#     0  1  2            index
# 0   0  0  4     (2, 0, 0, 2)
# 1   0  1  4     (0, 3, 1, 1)
# 2   0  2  1             (1,)
# 3   0  3  1             (3,)
# 4   1  0  4     (1, 2, 1, 3)
# 5   1  1  2           (1, 3)
# 6   1  2  4     (2, 1, 2, 2)
# 7   1  3  1             (2,)
# 8   2  0  5  (0, 3, 0, 2, 2)
# 9   2  1  2           (0, 2)
# 10  2  2  5  (1, 1, 3, 3, 2)
# 11  2  3  2           (0, 1)
# 12  3  0  4     (0, 3, 3, 3)
# 13  3  1  4     (1, 3, 0, 1)
# 14  3  2  3        (3, 2, 1)
# 15  3  3  4     (3, 3, 2, 1)

Steffi answered 11/3, 2018 at 3:44 Comment(2)

For clarity: .agg({'index': tuple}) works on the 'index' column of df (this is a column after df.reset_index()), and put its values, got when performing the grouping, as a tuple in resulting 'index' column of res DataFrame – Haematite 25/10, 2022 at 16:43

Depending on your DataFrame size, it may be worth using set instead of tuple since indices are unique anyway and their order usually doesn't matter (contrary to the example used in this answer) – Mowbray 26/7, 2023 at 16:32

O

15

I think you are are looking for transform in this situation:

df['count'] = df.groupby(['col1', 'col2'])['col3'].transform('count')

Oscine answered 11/3, 2018 at 4:17 Comment(5)

according to the comments, he wants to know which indices contributed to each group – Steffi 11/3, 2018 at 4:24

This seems to be the most optimal solution out there as of Pandas 0.25.1 – Incredible 28/8, 2019 at 19:17

is there a simple solution like this with performance similar to losing the index without transform? i'm using groups.last() and groups.transform('last') is much slower. pandas 1.3.4 – Boatbill 21/10, 2021 at 10:55

@jayen I don't understand your question here. – Oscine 21/10, 2021 at 12:24

groups.transform('last') is slow. groups.last() is fast. is there a fast way to use groupby and keep the index? i managed to do it by copying the index to a column and then restoring it after the groupby. almost as fast as groups.last() – Boatbill 21/10, 2021 at 14:2

B

0

You should not use 'reset_index()' if you want to keep your original indexes

Biddable answered 11/3, 2018 at 3:34 Comment(2)

that doesn't work, even if the reset_index() is not there, the groupby does not retain the original indexes – Stadiometer 11/3, 2018 at 3:35

you are correct, it won't solve the problem. My bad. Let me see if I can find any solution. – Biddable 11/3, 2018 at 4:0

P

0

To actually get the index, you need to do

df['count'] = df.groupby(['col1', 'col2'])['col3'].transform('idxmin') # for first occurrence, idxmax for last occurrence

N.B if your agg column is a datetime, you may get dates instead of the integer index: reference. issue with older versions of pandas.

Workaround this by converting the datetime column to int

df['date_col'] = df['date_col'].apply(lambda x:x.toordinal())

Phalarope answered 1/11, 2022 at 17:59 Comment(0)

Recommended topics

Hot tags