What is the best way to filter groups by two lambda conditions and create a new column based on the conditions?

Asked 5/3 at 4:56 Answered 5/3 at 15:25

This is my DataFrame:

import pandas as pd

df = pd.DataFrame(
    {
        'a': ['x', 'x', 'x', 'x', 'y', 'y', 'y', 'y', 'z', 'z', 'z', 'p', 'p', 'p', 'p'],
        'b': [1, -1, 1, 1, -1, 1, 1, -1, -1, -1, -1, 1, 1, 1, 1]
    }
)

And this the expected output. I want to create column c:

    a  b    c
0   x  1    first
1   x -1    first
2   x  1    first
3   x  1    first
4   y -1    second
5   y  1    second
6   y  1    second
7   y -1    second
11  p  1    first
12  p  1    first
13  p  1    first
14  p  1    first

Groups are defined by column a. I want to filter df and choose groups that either their first b is 1 OR their second b is 1.

I did this by this code:

df1 = df.groupby('a').filter(lambda x: (x.b.iloc[0] == 1) | (x.b.iloc[1] == 1))

And for creating column c for df1, again groups should be defined by a and then if for each group first b is 1 then c is first and if the second b is 1 then c is second.

Note that for group p, both first and second b is 1, for these groups I want c to be first.

Maybe the way that I approach the issue is totally wrong.

Electrolier answered 5/3 at 4:56 Comment(1)

Sorry, missed that part. – Crosstie 5/3 at 5:45

A generic method that works with any number of positions for the first 1:

d = {0: 'first', 1: 'second'}

s = (df.groupby('a')['b']
       .transform(lambda g: g.reset_index()[g.values==1]
                  .first_valid_index())
       .replace(d)
     )

out = df.assign(c=s).dropna(subset=['c'])

Notes:

if you remove the replace step you will get an integer in c
if you use map in place of replace you can ignore the positions that are not defined as a dictionary key

Output:

    a  b       c
0   x  1   first
1   x -1   first
2   x  1   first
3   x  1   first
4   y -1  second
5   y  1  second
6   y  1  second
7   y -1  second
11  p  1   first
12  p  1   first
13  p  1   first
14  p  1   first

Example from comments:

df = pd.DataFrame({'a': ['x', 'x', 'x', 'x', 'y', 'y', 'y', 'y', 'z', 'z', 'z', 'p', 'p', 'p', 'p'],
                  'b': [1, -1, 1, 1, -1, 1, 1, -1, -1, -1, 1, 1, 1, 1, 1]})

d = {0: 'first', 1: 'second'}

s = (df.groupby('a')['b']
       .transform(lambda g: g.reset_index()[g.values==1]
                  .first_valid_index())
       .map(d)
     )

out = df.assign(c=s).dropna(subset=['c'])

    a  b       c
0   x  1   first
1   x -1   first
2   x  1   first
3   x  1   first
4   y -1  second
5   y  1  second
6   y  1  second
7   y -1  second
11  p  1   first
12  p  1   first
13  p  1   first
14  p  1   first

You can also only filter the rows with:

m1 = df.groupby('a').cumcount().le(1)
m2 = df['b'].eq(1)
out = df.loc[df['a'].isin(df.loc[m1&m2, 'a'])]

Ecclesiasticism answered 5/3 at 6:25 Comment(4)

Thanks a lot. Note that I don't want to keep a group if it has a 1 other than first or second row. for example for this df I don't want to keep group z.

df = pd.DataFrame(     {         'a': ['x', 'x', 'x', 'x', 'y', 'y', 'y', 'y', 'z', 'z', 'z', 'p', 'p', 'p', 'p'],         'b': [1, -1, 1, 1, -1, 1, 1, -1, -1, -1, 1, 1, 1, 1, 1]     } )

– Electrolier 5/3 at 6:39

@x_Amir_x then use map in place of replace ;) – Ecclesiasticism 5/3 at 6:40

Ah. My bad :) I want to accept your answer. Do you think it is best one? I've got other answers too. – Electrolier 5/3 at 6:44

@x_Amir_x it's up to you, I'm adding another method if you only want to filter – Ecclesiasticism 5/3 at 6:49

I think transform could also help in this case-

df["c"] = df.groupby("a")["b"].transform(lambda x: "first" if x.iloc[0] == 1 else ("second" if x.iloc[1] == 1 else None))
df.dropna()

output ->

    a   b   c
0   x   1   first
1   x   -1  first
2   x   1   first
3   x   1   first
4   y   -1  second
5   y   1   second
6   y   1   second
7   y   -1  second
11  p   1   first
12  p   1   first
13  p   1   first
14  p   1   first

1.09 ms ± 119 µs per loop (mean ± std. dev. of 10 runs, 10 loops each)

If you want to do everything in a single line, -

df = df.assign(c=df.groupby("a")["b"].transform(lambda x: "first" if x.iloc[0] == 1 else ("second" if x.iloc[1] == 1 else None))).dropna()

But increases the time to - 1.24 ms ± 357 µs per loop (mean ± std. dev. of 10 runs, 10 loops each)

Xiomaraxiong answered 5/3 at 5:52 Comment(3)

No explicit filtering required! – Xiomaraxiong 5/3 at 5:56

Using lambda functions with groupby transform for boolean masking is not recommended when dealing with a large number of groups, as it can significantly slow down the process. Of course this is not an issue with smaller group sizes, such as the example with 4 groups. I also think that easy-to-understand code is a good answer. – Therapeutic 5/3 at 6:17

Thanks for adding for more info into my answer. Will keep that in mind! – Xiomaraxiong 6/3 at 0:41

Use GroupBy.cumcount for counter, filter only 1 in b and mapping by dictionary only first or second 1 in Series.map, then remove non matched rows by Series.dropna, add a column by DataFrame.join, remove duplicates and append to original DataFrame in DataFrame.merge:

s = df.groupby('a').cumcount()[df['b'].eq(1)].map({0: 'first', 1: 'second'}).dropna()

out = (df.merge(s.to_frame('c').join(df.a).drop_duplicates('a'), how='left')
         .dropna(subset=['c']))
print (out)
    a  b       c
0   x  1   first
1   x -1   first
2   x  1   first
3   x  1   first
4   y -1  second
5   y  1  second
6   y  1  second
7   y -1  second
11  p  1   first
12  p  1   first
13  p  1   first
14  p  1   first

Another idea:

s = (df.assign(g = df.groupby('a').cumcount())[df['b'].eq(1)]
       .drop_duplicates('a').set_index('a')['g'])

out = df.assign(c = df['a'].map(s.map({0: 'first', 1: 'second'}))).dropna(subset=['c'])
print (out)
    a  b       c
0   x  1   first
1   x -1   first
2   x  1   first
3   x  1   first
4   y -1  second
5   y  1  second
6   y  1  second
7   y -1  second
11  p  1   first
12  p  1   first
13  p  1   first
14  p  1   first

Wells answered 5/3 at 6:40 Comment(0)

Code

use groupby nth

g = df.groupby(['a'])['b']
first = g.nth(0)[lambda x: x.eq(1)].replace(1, 'first')
second = g.nth(1)[lambda x: x.eq(1)].replace(1, 'second')
m = {**second, **first}
df['c'] = df['a'].map(m)

    a   b   c
0   x   1   first
1   x   -1  first
2   x   1   first
3   x   1   first
4   y   -1  second
5   y   1   second
6   y   1   second
7   y   -1  second
8   z   -1  NaN
9   z   -1  NaN
10  z   -1  NaN
11  p   1   first
12  p   1   first
13  p   1   first
14  p   1   first

filter:

out = df[df['c'].notna()]

Therapeutic answered 5/3 at 5:26 Comment(0)

A generic method that works with any number of positions for the first 1:

d = {0: 'first', 1: 'second'}

s = (df.groupby('a')['b']
       .transform(lambda g: g.reset_index()[g.values==1]
                  .first_valid_index())
       .replace(d)
     )

out = df.assign(c=s).dropna(subset=['c'])

Notes:

if you remove the replace step you will get an integer in c
if you use map in place of replace you can ignore the positions that are not defined as a dictionary key

Output:

    a  b       c
0   x  1   first
1   x -1   first
2   x  1   first
3   x  1   first
4   y -1  second
5   y  1  second
6   y  1  second
7   y -1  second
11  p  1   first
12  p  1   first
13  p  1   first
14  p  1   first

Example from comments:

df = pd.DataFrame({'a': ['x', 'x', 'x', 'x', 'y', 'y', 'y', 'y', 'z', 'z', 'z', 'p', 'p', 'p', 'p'],
                  'b': [1, -1, 1, 1, -1, 1, 1, -1, -1, -1, 1, 1, 1, 1, 1]})

d = {0: 'first', 1: 'second'}

s = (df.groupby('a')['b']
       .transform(lambda g: g.reset_index()[g.values==1]
                  .first_valid_index())
       .map(d)
     )

out = df.assign(c=s).dropna(subset=['c'])

    a  b       c
0   x  1   first
1   x -1   first
2   x  1   first
3   x  1   first
4   y -1  second
5   y  1  second
6   y  1  second
7   y -1  second
11  p  1   first
12  p  1   first
13  p  1   first
14  p  1   first

You can also only filter the rows with:

m1 = df.groupby('a').cumcount().le(1)
m2 = df['b'].eq(1)
out = df.loc[df['a'].isin(df.loc[m1&m2, 'a'])]

Ecclesiasticism answered 5/3 at 6:25 Comment(4)

Thanks a lot. Note that I don't want to keep a group if it has a 1 other than first or second row. for example for this df I don't want to keep group z.

df = pd.DataFrame(     {         'a': ['x', 'x', 'x', 'x', 'y', 'y', 'y', 'y', 'z', 'z', 'z', 'p', 'p', 'p', 'p'],         'b': [1, -1, 1, 1, -1, 1, 1, -1, -1, -1, 1, 1, 1, 1, 1]     } )

– Electrolier 5/3 at 6:39

@x_Amir_x then use map in place of replace ;) – Ecclesiasticism 5/3 at 6:40

Ah. My bad :) I want to accept your answer. Do you think it is best one? I've got other answers too. – Electrolier 5/3 at 6:44

@x_Amir_x it's up to you, I'm adding another method if you only want to filter – Ecclesiasticism 5/3 at 6:49

While pandas has idxmax(), numpy has argmax() which instead of giving the index label like idxmax(), it will give the position.

(df.assign(
    c = df['b'].where(df['b'].eq(1))
    .groupby(df['a'])
    .transform(lambda x: x.argmax())
    .map(dict(enumerate(['first','second'])))
    )
    .dropna(subset = 'c'))

Output:

    a  b       c
0   x  1   first
1   x -1   first
2   x  1   first
3   x  1   first
4   y -1  second
5   y  1  second
6   y  1  second
7   y -1  second
11  p  1   first
12  p  1   first
13  p  1   first
14  p  1   first

Jack answered 5/3 at 15:25 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags