Keep other columns when doing groupby [duplicate]
Asked Answered
D

5

101

I'm using groupby on a pandas dataframe to drop all rows that don't have the minimum of a specific column. Something like this:

df1 = df.groupby("item", as_index=False)["diff"].min()

However, if I have more than those two columns, the other columns (e.g. otherstuff in my example) get dropped. Can I keep those columns using groupby, or am I going to have to find a different way to drop the rows?

My data looks like:

    item    diff   otherstuff
   0   1       2            1
   1   1       1            2
   2   1       3            7
   3   2      -1            0
   4   2       1            3
   5   2       4            9
   6   2      -6            2
   7   3       0            0
   8   3       2            9

and should end up like:

    item   diff  otherstuff
   0   1      1           2
   1   2     -6           2
   2   3      0           0

but what I'm getting is:

    item   diff
   0   1      1           
   1   2     -6           
   2   3      0                 

I've been looking through the documentation and can't find anything. I tried:

df1 = df.groupby(["item", "otherstuff"], as_index=false)["diff"].min()

df1 = df.groupby("item", as_index=false)["diff"].min()["otherstuff"]

df1 = df.groupby("item", as_index=false)["otherstuff", "diff"].min()

But none of those work.

Deportee answered 30/4, 2014 at 17:29 Comment(1)
So you want to keep the entire row which contains the minimum of 'diff' column, including whatever values the other columns are. Yes this isn't covered by pandas doc but should be, please raise a docbug.Preuss
K
175

Method #1: use idxmin() to get the indices of the elements of minimum diff, and then select those:

>>> df.loc[df.groupby("item")["diff"].idxmin()]
   item  diff  otherstuff
1     1     1           2
6     2    -6           2
7     3     0           0

[3 rows x 3 columns]

Method #2: sort by diff, and then take the first element in each item group:

>>> df.sort_values("diff").groupby("item", as_index=False).first()
   item  diff  otherstuff
0     1     1           2
1     2    -6           2
2     3     0           0

[3 rows x 3 columns]

Note that the resulting indices are different even though the row content is the same.

Koren answered 30/4, 2014 at 17:43 Comment(3)
The first returns all NaN, but the second method works. Thank you!Deportee
Thank you for providing two ways of doing this. I have found multiple instances that one works when the other does not and visa versa. Seems like it depends on if your elements are numerical or not. Good to have both.Salsify
Thanks for this! I found the first one to be very slow and incorrect, but the second one was fast and did just what I wanted. I was using multiple indices simultaneously, and that may have messed with things.Conceptionconceptual
A
15

You can use DataFrame.sort_values with DataFrame.drop_duplicates:

df = df.sort_values(by='diff').drop_duplicates(subset='item')
print (df)
   item  diff  otherstuff
6     2    -6           2
7     3     0           0
1     1     1           2

If possible multiple minimal values per groups and want all min rows use boolean indexing with transform for minimal values per groups:

print (df)
   item  diff  otherstuff
0     1     2           1
1     1     1           2 <-multiple min
2     1     1           7 <-multiple min
3     2    -1           0
4     2     1           3
5     2     4           9
6     2    -6           2
7     3     0           0
8     3     2           9

print (df.groupby("item")["diff"].transform('min'))
0    1
1    1
2    1
3   -6
4   -6
5   -6
6   -6
7    0
8    0
Name: diff, dtype: int64

df = df[df.groupby("item")["diff"].transform('min') == df['diff']]
print (df)
   item  diff  otherstuff
1     1     1           2
2     1     1           7
6     2    -6           2
7     3     0           0
Alicia answered 8/1, 2019 at 7:43 Comment(0)
M
1

The above answer worked great if there is / you want one min. In my case there could be multiple mins and I wanted all rows equal to min which .idxmin() doesn't give you. This worked

def filter_group(dfg, col):
    return dfg[dfg[col] == dfg[col].min()]

df = pd.DataFrame({'g': ['a'] * 6 + ['b'] * 6, 'v1': (list(range(3)) + list(range(3))) * 2, 'v2': range(12)})
df.groupby('g',group_keys=False).apply(lambda x: filter_group(x,'v1'))

As an aside, .filter() is also relevant to this question but didn't work for me.

Motoneuron answered 25/3, 2018 at 13:4 Comment(0)
F
1

I tried everyone's method and I couldn't get it to work properly. Instead I did the process step-by-step and ended up with the correct result.

df.sort_values(by='item', inplace=True, ignore_index=True)
df.drop_duplicates(subset='diff', inplace=True, ignore_index=True)
df.sort_values(by=['diff'], inplace=True, ignore_index=True)

For a little more explanation:

  1. Sort items by the minimum value you want
  2. Drop the duplicates of the column you want to sort with
  3. Resort the data because the data is still sorted by the minimum values
Foliole answered 18/6, 2021 at 1:25 Comment(0)
A
-1

If you know that all of your "items" have more than one record you can sort, then use duplicated:

df.sort_values(by='diff').duplicated(subset='item', keep='first')
Aurangzeb answered 17/9, 2018 at 22:1 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.