Since pandas 2.2.0, you can use case_when()
on a column. Just initialize with the default value and replace values in it using case_when()
, which accepts a list of (condition, replacement) tuples. For the example in the OP, we can use the following.
pd_df["difficulty"] = "Unknown"
pd_df["difficulty"] = pd_df["difficulty"].case_when([
(pd_df.eval("0 < Time < 30"), "Easy"),
(pd_df.eval("30 <= Time <= 60"), "Medium"),
(pd_df.eval("Time > 60"), "Hard")
])
loc
OP's code only needed loc
to correctly call the __setitem__()
method via []
. In particular, they already have used the proper brackets ()
to evaluate &
-chained conditions individually.
The basic idea of this approach is to initialize a column with some default value (e.g. "Unknown"
) and update rows depending on condtions (e.g. "Easy"
if 0<Time<30
), etc.
When I time the options given on this page, for large frames, loc
approach is the fastest (4-5 times faster than np.select
and nested np.where
).1.
pd_df['difficulty'] = 'Unknown'
pd_df.loc[(pd_df['Time']<30) & (pd_df['Time']>0), 'difficulty'] = 'Easy'
pd_df.loc[(pd_df['Time']>=30) & (pd_df['Time']<=60), 'difficulty'] = 'Medium'
pd_df.loc[pd_df['Time']>60, 'difficulty'] = 'Hard'
1: Code used for benchmark.
def loc(pd_df):
pd_df['difficulty'] = 'Unknown'
pd_df.loc[(pd_df['Time']<30) & (pd_df['Time']>0), 'difficulty'] = 'Easy'
pd_df.loc[(pd_df['Time']>=30) & (pd_df['Time']<=60), 'difficulty'] = 'Medium'
pd_df.loc[pd_df['Time']>60, 'difficulty'] = 'Hard'
return pd_df
def np_select(pd_df):
pd_df['difficulty'] = np.select([pd_df['Time'].between(0, 30, inclusive='neither'), pd_df['Time'].between(30, 60, inclusive='both'), pd_df['Time']>60], ['Easy', 'Medium', 'Hard'], 'Unknown')
return pd_df
def nested_np_where(pd_df):
pd_df['difficulty'] = np.where(pd_df['Time'].between(0, 30, inclusive='neither'), 'Easy', np.where(pd_df['Time'].between(30, 60, inclusive='both'), 'Medium', np.where(pd_df['Time'] > 60, 'Hard', 'Unknown')))
return pd_df
df = pd.DataFrame({'Time': np.random.default_rng().choice(120, size=15_000_000)-30})
%timeit loc(df.copy())
# 891 ms ± 6.14 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit np_select(df.copy())
# 3.93 s ± 100 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit nested_np_where(df.copy())
# 4.82 s ± 1.05 s per loop (mean ± std. dev. of 7 runs, 10 loops each)