Find half of each group with Pandas GroupBy

Asked 27/6, 2017 at 19:42 Answered 27/6, 2017 at 19:54

Solved python pandas pandas-groupby split-apply-combine

I need to select half of a dataframe using the groupby, where the size of each group is unknown and may vary across groups. For example:

       index  summary  participant_id
0     130599     17.0              13
1     130601     18.0              13
2     130603     16.0              13
3     130605     15.0              13
4     130607     15.0              13
5     130609     16.0              13
6     130611     17.0              13
7     130613     15.0              13
8     130615     17.0              13
9     130617     17.0              13
10     86789     12.0              14
11     86791      8.0              14
12     86793     21.0              14
13     86795     19.0              14
14     86797     20.0              14
15     86799      9.0              14
16     86801     10.0              14
20    107370      1.0              15
21    107372      2.0              15
22    107374      2.0              15
23    107376      4.0              15
24    107378      4.0              15
25    107380      7.0              15
26    107382      6.0              15
27    107597      NaN              15
28    107384     14.0              15

The size of groups from groupyby('participant_id') are 10, 7, 9 for participant_id 13, 14, 15 respectively. What I need is to take only the FIRST half (or floor(N/2)) of each group.

From my (very limited) experience with Pandas groupby, it should be something like:

df.groupby('participant_id')[['summary','participant_id']].apply(lambda x: x[:k_i])

where k_i is the half of the size of each group. Is there a simple solution to find the k_i?

Recurvate answered 27/6, 2017 at 19:42 Comment(3)

When you say "take the first half of each group": do you care how the data is ordered? I.e., do you want to first sort by index or summary, and then take the first half of each group? – Vestryman 27/6, 2017 at 19:47

@BradSolomon, you can ignore index, by 'first' I mean take (if there are 10 lines in each group for example) only first 5 lines from each group. Does it help? – Recurvate 27/6, 2017 at 19:51

How is df.groupby('participant_id').apply(lambda x: x[:len(x)//2])? – Petulia 27/6, 2017 at 19:53

IIUC, you can use index slicing with size //2 inside of lambda:

df.groupby('participant_id').apply(lambda x: x.iloc[:x.participant_id.size//2])

Output:

                    index  summary  participant_id
participant_id                                    
13             0   130599     17.0              13
               1   130601     18.0              13
               2   130603     16.0              13
               3   130605     15.0              13
               4   130607     15.0              13
14             10   86789     12.0              14
               11   86791      8.0              14
               12   86793     21.0              14
15             20  107370      1.0              15
               21  107372      2.0              15
               22  107374      2.0              15
               23  107376      4.0              15

Sparerib answered 27/6, 2017 at 19:54 Comment(4)

Nice answer, you may also want to consider setting as_index=False within .groupby(). – Vestryman 27/6, 2017 at 19:57

@BradSolomon, thanks! I am learning Pandas and your solutions guys very neat! – Recurvate 27/6, 2017 at 20:0

@ArnoldKlein or you can add .reset_index(level=0, drop=True) – Sparerib 27/6, 2017 at 20:4

Just an obvious side note, the same can be written using multiplication wrapped with int, like so: df.groupby('participant_id').apply(lambda x: x.iloc[:int(x.participant_id.size*0.5)]) , and in my case, I also wanted a minimum of 1, so need to be careful and use something like: df.groupby('participant_id').apply(lambda x: x.iloc[:max(int(x.participant_id.size*0.5),1)]) – Sirree 10/10, 2020 at 11:16

You could group by participant_id and check whether its index is in the first half with the transform method. This will create a boolean Series. Then use this boolean series to filter out your original dataframe.

criteria = df.groupby('participant_id')['participant_id']\
             .transform(lambda x:  np.arange(len(x)) < int(len(x) / 2))
df[criteria]

     index  summary  participant_id
0   130599     17.0              13
1   130601     18.0              13
2   130603     16.0              13
3   130605     15.0              13
4   130607     15.0              13
10   86789     12.0              14
11   86791      8.0              14
12   86793     21.0              14
20  107370      1.0              15
21  107372      2.0              15
22  107374      2.0              15
23  107376      4.0              15

Anabatic answered 27/6, 2017 at 19:54 Comment(1)

thanks, great answer, but the previous one seems to be simpler. – Recurvate 27/6, 2017 at 20:6

Recommended topics

Hot tags