Pandas apply but only for rows where a condition is met
Asked Answered
H

5

84

I would like to use Pandas df.apply but only for certain rows

As an example, I want to do something like this, but my actual issue is a little more complicated:

import pandas as pd
import math
z = pd.DataFrame({'a':[4.0,5.0,6.0,7.0,8.0],'b':[6.0,0,5.0,0,1.0]})
z.where(z['b'] != 0, z['a'] / z['b'].apply(lambda l: math.log(l)), 0)

What I want in this example is the value in 'a' divided by the log of the value in 'b' for each row, and for rows where 'b' is 0, I simply want to return 0.

Hach answered 18/11, 2015 at 0:43 Comment(0)
U
103

The other answers are excellent, but I thought I'd add one other approach that can be faster in some circumstances – using broadcasting and masking to achieve the same result:

import numpy as np

mask = (z['b'] != 0)
z_valid = z[mask]

z['c'] = 0
z.loc[mask, 'c'] = z_valid['a'] / np.log(z_valid['b'])

Especially with very large dataframes, this approach will generally be faster than solutions based on apply().

Unbridled answered 18/11, 2015 at 1:47 Comment(5)
So this mask masks out the values that you don't want. In this case, you are "selecting" those z values that are not zero. Is that correct?Drinkwater
It's a boolean mask that selects just the nonzero values. You can read more here: jakevdp.github.io/PythonDataScienceHandbook/…Unbridled
What if we need to apply a function to a large dataframe? Could we do something like z.loc[mask, 'c'] = func(z_valid['a'], z_valid['b'] ?Romeoromeon
For this approach, does the mask selection runs twice? For example, in z[mask] and in z.loc[mask, 'c']Brann
you can use this answer like a sql update with a where clause: update table set table.c = table.a / table.b where table.b is not null. Except you define the "where" clause first.Outstanding
B
59

You can just use an if statement in a lambda function.

z['c'] = z.apply(lambda row: 0 if row['b'] in (0,1) else row['a'] / math.log(row['b']), axis=1)

I also excluded 1, because log(1) is zero.

Output:

   a  b         c
0  4  6  2.232443
1  5  0  0.000000
2  6  5  3.728010
3  7  0  0.000000
4  8  1  0.000000
Blameful answered 18/11, 2015 at 1:13 Comment(3)
I know I'm late to the game here, but why do you need to specify axis =1? Isn't it specified in the syntax? and why axis=1 not 0?Koa
@Koa see "axis": pandas.pydata.org/pandas-docs/stable/generated/…Blameful
axis = 1 needs to be a parameter of the apply() function, not of lambda. You might need extra brackets around your lambda function.Janniejanos
O
18

Hope this helps. It is easy and readable

df['c']=df['b'].apply(lambda x: 0 if x ==0 else math.log(x))
Ophthalmia answered 18/11, 2015 at 9:27 Comment(0)
C
6

You can use a lambda with a conditional to return 0 if the input value is 0 and skip the whole where clause:

z['c'] = z.apply(lambda x: math.log(x.b) if x.b > 0 else 0, axis=1)

You also have to assign the results to a new column (z['c']).

Coronagraph answered 18/11, 2015 at 1:13 Comment(0)
I
1

Use np.where() which divides a by the log of the value in b if the condition is met and returns 0 otherwise:

import numpy as np
z['c'] = np.where(z['b'] != 0, z['a'] / np.log(z['b']), 0)

Output:

     a    b         c
0  4.0  6.0  2.232443
1  5.0  0.0  0.000000
2  6.0  5.0  3.728010
3  7.0  0.0  0.000000
4  8.0  1.0       inf
Illustrate answered 24/5, 2022 at 19:58 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.