What is the best way to filter groups by two lambda conditions and create a new column based on the conditions?
Asked Answered
E

5

6

This is my DataFrame:

import pandas as pd

df = pd.DataFrame(
    {
        'a': ['x', 'x', 'x', 'x', 'y', 'y', 'y', 'y', 'z', 'z', 'z', 'p', 'p', 'p', 'p'],
        'b': [1, -1, 1, 1, -1, 1, 1, -1, -1, -1, -1, 1, 1, 1, 1]
    }
)

And this the expected output. I want to create column c:

    a  b    c
0   x  1    first
1   x -1    first
2   x  1    first
3   x  1    first
4   y -1    second
5   y  1    second
6   y  1    second
7   y -1    second
11  p  1    first
12  p  1    first
13  p  1    first
14  p  1    first

Groups are defined by column a. I want to filter df and choose groups that either their first b is 1 OR their second b is 1.

I did this by this code:

df1 = df.groupby('a').filter(lambda x: (x.b.iloc[0] == 1) | (x.b.iloc[1] == 1))

And for creating column c for df1, again groups should be defined by a and then if for each group first b is 1 then c is first and if the second b is 1 then c is second.

Note that for group p, both first and second b is 1, for these groups I want c to be first.

Maybe the way that I approach the issue is totally wrong.

Electrolier answered 5/3 at 4:56 Comment(1)
Sorry, missed that part.Crosstie
E
2

A generic method that works with any number of positions for the first 1:

d = {0: 'first', 1: 'second'}

s = (df.groupby('a')['b']
       .transform(lambda g: g.reset_index()[g.values==1]
                  .first_valid_index())
       .replace(d)
     )

out = df.assign(c=s).dropna(subset=['c'])

Notes:

  • if you remove the replace step you will get an integer in c
  • if you use map in place of replace you can ignore the positions that are not defined as a dictionary key

Output:

    a  b       c
0   x  1   first
1   x -1   first
2   x  1   first
3   x  1   first
4   y -1  second
5   y  1  second
6   y  1  second
7   y -1  second
11  p  1   first
12  p  1   first
13  p  1   first
14  p  1   first

Example from comments:

df = pd.DataFrame({'a': ['x', 'x', 'x', 'x', 'y', 'y', 'y', 'y', 'z', 'z', 'z', 'p', 'p', 'p', 'p'],
                  'b': [1, -1, 1, 1, -1, 1, 1, -1, -1, -1, 1, 1, 1, 1, 1]})

d = {0: 'first', 1: 'second'}

s = (df.groupby('a')['b']
       .transform(lambda g: g.reset_index()[g.values==1]
                  .first_valid_index())
       .map(d)
     )

out = df.assign(c=s).dropna(subset=['c'])

    a  b       c
0   x  1   first
1   x -1   first
2   x  1   first
3   x  1   first
4   y -1  second
5   y  1  second
6   y  1  second
7   y -1  second
11  p  1   first
12  p  1   first
13  p  1   first
14  p  1   first

You can also only filter the rows with:

m1 = df.groupby('a').cumcount().le(1)
m2 = df['b'].eq(1)
out = df.loc[df['a'].isin(df.loc[m1&m2, 'a'])]
Ecclesiasticism answered 5/3 at 6:25 Comment(4)
Thanks a lot. Note that I don't want to keep a group if it has a 1 other than first or second row. for example for this df I don't want to keep group z. df = pd.DataFrame( { 'a': ['x', 'x', 'x', 'x', 'y', 'y', 'y', 'y', 'z', 'z', 'z', 'p', 'p', 'p', 'p'], 'b': [1, -1, 1, 1, -1, 1, 1, -1, -1, -1, 1, 1, 1, 1, 1] } )Electrolier
@x_Amir_x then use map in place of replace ;)Ecclesiasticism
Ah. My bad :) I want to accept your answer. Do you think it is best one? I've got other answers too.Electrolier
@x_Amir_x it's up to you, I'm adding another method if you only want to filterEcclesiasticism
X
5

I think transform could also help in this case-

df["c"] = df.groupby("a")["b"].transform(lambda x: "first" if x.iloc[0] == 1 else ("second" if x.iloc[1] == 1 else None))
df.dropna()

output ->

    a   b   c
0   x   1   first
1   x   -1  first
2   x   1   first
3   x   1   first
4   y   -1  second
5   y   1   second
6   y   1   second
7   y   -1  second
11  p   1   first
12  p   1   first
13  p   1   first
14  p   1   first

1.09 ms ± 119 µs per loop (mean ± std. dev. of 10 runs, 10 loops each)

If you want to do everything in a single line, -

df = df.assign(c=df.groupby("a")["b"].transform(lambda x: "first" if x.iloc[0] == 1 else ("second" if x.iloc[1] == 1 else None))).dropna()

But increases the time to - 1.24 ms ± 357 µs per loop (mean ± std. dev. of 10 runs, 10 loops each)

Xiomaraxiong answered 5/3 at 5:52 Comment(3)
No explicit filtering required!Xiomaraxiong
Using lambda functions with groupby transform for boolean masking is not recommended when dealing with a large number of groups, as it can significantly slow down the process. Of course this is not an issue with smaller group sizes, such as the example with 4 groups. I also think that easy-to-understand code is a good answer.Therapeutic
Thanks for adding for more info into my answer. Will keep that in mind!Xiomaraxiong
W
3

Use GroupBy.cumcount for counter, filter only 1 in b and mapping by dictionary only first or second 1 in Series.map, then remove non matched rows by Series.dropna, add a column by DataFrame.join, remove duplicates and append to original DataFrame in DataFrame.merge:

s = df.groupby('a').cumcount()[df['b'].eq(1)].map({0: 'first', 1: 'second'}).dropna()

out = (df.merge(s.to_frame('c').join(df.a).drop_duplicates('a'), how='left')
         .dropna(subset=['c']))
print (out)
    a  b       c
0   x  1   first
1   x -1   first
2   x  1   first
3   x  1   first
4   y -1  second
5   y  1  second
6   y  1  second
7   y -1  second
11  p  1   first
12  p  1   first
13  p  1   first
14  p  1   first  

Another idea:

s = (df.assign(g = df.groupby('a').cumcount())[df['b'].eq(1)]
       .drop_duplicates('a').set_index('a')['g'])

out = df.assign(c = df['a'].map(s.map({0: 'first', 1: 'second'}))).dropna(subset=['c'])
print (out)
    a  b       c
0   x  1   first
1   x -1   first
2   x  1   first
3   x  1   first
4   y -1  second
5   y  1  second
6   y  1  second
7   y -1  second
11  p  1   first
12  p  1   first
13  p  1   first
14  p  1   first
Wells answered 5/3 at 6:40 Comment(0)
T
2

Code

use groupby nth

g = df.groupby(['a'])['b']
first = g.nth(0)[lambda x: x.eq(1)].replace(1, 'first')
second = g.nth(1)[lambda x: x.eq(1)].replace(1, 'second')
m = {**second, **first}
df['c'] = df['a'].map(m)

df

    a   b   c
0   x   1   first
1   x   -1  first
2   x   1   first
3   x   1   first
4   y   -1  second
5   y   1   second
6   y   1   second
7   y   -1  second
8   z   -1  NaN
9   z   -1  NaN
10  z   -1  NaN
11  p   1   first
12  p   1   first
13  p   1   first
14  p   1   first

filter:

out = df[df['c'].notna()]
Therapeutic answered 5/3 at 5:26 Comment(0)
E
2

A generic method that works with any number of positions for the first 1:

d = {0: 'first', 1: 'second'}

s = (df.groupby('a')['b']
       .transform(lambda g: g.reset_index()[g.values==1]
                  .first_valid_index())
       .replace(d)
     )

out = df.assign(c=s).dropna(subset=['c'])

Notes:

  • if you remove the replace step you will get an integer in c
  • if you use map in place of replace you can ignore the positions that are not defined as a dictionary key

Output:

    a  b       c
0   x  1   first
1   x -1   first
2   x  1   first
3   x  1   first
4   y -1  second
5   y  1  second
6   y  1  second
7   y -1  second
11  p  1   first
12  p  1   first
13  p  1   first
14  p  1   first

Example from comments:

df = pd.DataFrame({'a': ['x', 'x', 'x', 'x', 'y', 'y', 'y', 'y', 'z', 'z', 'z', 'p', 'p', 'p', 'p'],
                  'b': [1, -1, 1, 1, -1, 1, 1, -1, -1, -1, 1, 1, 1, 1, 1]})

d = {0: 'first', 1: 'second'}

s = (df.groupby('a')['b']
       .transform(lambda g: g.reset_index()[g.values==1]
                  .first_valid_index())
       .map(d)
     )

out = df.assign(c=s).dropna(subset=['c'])

    a  b       c
0   x  1   first
1   x -1   first
2   x  1   first
3   x  1   first
4   y -1  second
5   y  1  second
6   y  1  second
7   y -1  second
11  p  1   first
12  p  1   first
13  p  1   first
14  p  1   first

You can also only filter the rows with:

m1 = df.groupby('a').cumcount().le(1)
m2 = df['b'].eq(1)
out = df.loc[df['a'].isin(df.loc[m1&m2, 'a'])]
Ecclesiasticism answered 5/3 at 6:25 Comment(4)
Thanks a lot. Note that I don't want to keep a group if it has a 1 other than first or second row. for example for this df I don't want to keep group z. df = pd.DataFrame( { 'a': ['x', 'x', 'x', 'x', 'y', 'y', 'y', 'y', 'z', 'z', 'z', 'p', 'p', 'p', 'p'], 'b': [1, -1, 1, 1, -1, 1, 1, -1, -1, -1, 1, 1, 1, 1, 1] } )Electrolier
@x_Amir_x then use map in place of replace ;)Ecclesiasticism
Ah. My bad :) I want to accept your answer. Do you think it is best one? I've got other answers too.Electrolier
@x_Amir_x it's up to you, I'm adding another method if you only want to filterEcclesiasticism
J
2

While pandas has idxmax(), numpy has argmax() which instead of giving the index label like idxmax(), it will give the position.

(df.assign(
    c = df['b'].where(df['b'].eq(1))
    .groupby(df['a'])
    .transform(lambda x: x.argmax())
    .map(dict(enumerate(['first','second'])))
    )
    .dropna(subset = 'c'))

Output:

    a  b       c
0   x  1   first
1   x -1   first
2   x  1   first
3   x  1   first
4   y -1  second
5   y  1  second
6   y  1  second
7   y -1  second
11  p  1   first
12  p  1   first
13  p  1   first
14  p  1   first
Jack answered 5/3 at 15:25 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.