Pandas - Case when & default in pandas
Asked Answered
C

4

54

I have the below case statement in python,

pd_df['difficulty'] = 'Unknown'
pd_df['difficulty'][(pd_df['Time']<30) & (pd_df['Time']>0)] = 'Easy'
pd_df['difficulty'][(pd_df['Time']>=30) & (pd_df['Time']<=60)] = 'Medium'
pd_df['difficulty'][pd_df['Time']>60] = 'Hard'

But when I run the code, it throws an error.

A value is trying to be set on a copy of a slice from a DataFrame
Chromatogram answered 12/3, 2018 at 5:23 Comment(0)
K
109

Option 1
For performance, use a nested np.where condition. For the condition, you can just use pd.Series.between, and the default value will be inserted accordingly.

pd_df['difficulty'] = np.where(
     pd_df['Time'].between(0, 30, inclusive=False), 
    'Easy', 
     np.where(
        pd_df['Time'].between(0, 30, inclusive=False), 'Medium', 'Unknown'
     )
)

Option 2
Similarly, using np.select, this gives more room for adding conditions:

pd_df['difficulty'] = np.select(
    [
        pd_df['Time'].between(0, 30, inclusive=False), 
        pd_df['Time'].between(30, 60, inclusive=True)
    ], 
    [
        'Easy', 
        'Medium'
    ], 
    default='Unknown'
)

Option 3
Another performant solution involves loc:

pd_df['difficulty'] = 'Unknown'
pd_df.loc[pd_df['Time'].between(0, 30, inclusive=False), 'difficulty'] = 'Easy'
pd_df.loc[pd_df['Time'].between(30, 60, inclusive=True), 'difficulty'] = 'Medium'
Kinser answered 12/3, 2018 at 5:27 Comment(0)
T
9

case_when

Since pandas 2.2.0, you can use case_when() on a column. Just initialize with the default value and replace values in it using case_when(), which accepts a list of (condition, replacement) tuples. For the example in the OP, we can use the following.

pd_df["difficulty"] = "Unknown"
pd_df["difficulty"] = pd_df["difficulty"].case_when([
    (pd_df.eval("0 < Time < 30"), "Easy"), 
    (pd_df.eval("30 <= Time <= 60"), "Medium"), 
    (pd_df.eval("Time > 60"), "Hard")
])

loc

OP's code only needed loc to correctly call the __setitem__() method via []. In particular, they already have used the proper brackets () to evaluate &-chained conditions individually.

The basic idea of this approach is to initialize a column with some default value (e.g. "Unknown") and update rows depending on condtions (e.g. "Easy" if 0<Time<30), etc.

When I time the options given on this page, for large frames, loc approach is the fastest (4-5 times faster than np.select and nested np.where).1.

pd_df['difficulty'] = 'Unknown'
pd_df.loc[(pd_df['Time']<30) & (pd_df['Time']>0), 'difficulty'] = 'Easy'
pd_df.loc[(pd_df['Time']>=30) & (pd_df['Time']<=60), 'difficulty'] = 'Medium'
pd_df.loc[pd_df['Time']>60, 'difficulty'] = 'Hard'


1: Code used for benchmark.

def loc(pd_df):
    pd_df['difficulty'] = 'Unknown'
    pd_df.loc[(pd_df['Time']<30) & (pd_df['Time']>0), 'difficulty'] = 'Easy'
    pd_df.loc[(pd_df['Time']>=30) & (pd_df['Time']<=60), 'difficulty'] = 'Medium'
    pd_df.loc[pd_df['Time']>60, 'difficulty'] = 'Hard'
    return pd_df

def np_select(pd_df):
    pd_df['difficulty'] = np.select([pd_df['Time'].between(0, 30, inclusive='neither'), pd_df['Time'].between(30, 60, inclusive='both'), pd_df['Time']>60], ['Easy', 'Medium', 'Hard'], 'Unknown')
    return pd_df

def nested_np_where(pd_df):
    pd_df['difficulty'] = np.where(pd_df['Time'].between(0, 30, inclusive='neither'), 'Easy', np.where(pd_df['Time'].between(30, 60, inclusive='both'), 'Medium', np.where(pd_df['Time'] > 60, 'Hard', 'Unknown')))
    return pd_df


df = pd.DataFrame({'Time': np.random.default_rng().choice(120, size=15_000_000)-30})

%timeit loc(df.copy())
# 891 ms ± 6.14 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit np_select(df.copy())
# 3.93 s ± 100 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit nested_np_where(df.copy())
# 4.82 s ± 1.05 s per loop (mean ± std. dev. of 7 runs, 10 loops each)
Transported answered 6/9, 2022 at 9:37 Comment(0)
H
4

For the sake of readability "match-case" syntax is introduced since python 3.10.
https://peps.python.org/pep-0622/

import pandas as pd

# Sample data
data = {
    'Time': [15, 45, 70, -5, 0, 30, 60]
    }
pd_df = pd.DataFrame(data)

def classify_difficulty(time):
    match time:
        case _ if 0 < time < 30:
            return 'Easy'
        case _ if 30 <= time <= 60:
            return 'Medium'
        case _ if time > 60:
            return 'Hard'
        case _:
            return 'Unknown'

pd_df['difficulty'] = pd_df['Time'].apply(classify_difficulty)

print(pd_df)
   Time difficulty
0    15       Easy
1    45     Medium
2    70       Hard
3    -5    Unknown
4     0    Unknown
5    30     Medium
6    60     Medium
Heteropolar answered 9/8, 2023 at 9:37 Comment(0)
D
2
import numpy as np
import pandas as pd

def case_when(*args):
    return np.select(
        condlist = args[::2],
        choicelist = args[1::2],
        default=pd.NA
    )

df = pd.DataFrame({"cola":["a","b","a","a","c","d","d","e","c"],
                   "colb":range(9)})

df["newcol"] = case_when(df["cola"] == "a","ap",
                         df["colb"] == 0, "x", # Not taken because it's after the first line
                         df["colb"] == 1, "y",
                         True, df["cola"]
                         )

df["newcolb"] = case_when(df["cola"] == "e",1,
                          df["colb"] == 8, 2
                          )

df

#   cola  colb newcol newcolb
# 0    a     0     ap    <NA>
# 1    b     1      y    <NA>
# 2    a     2     ap    <NA>
# 3    a     3     ap    <NA>
# 4    c     4      c    <NA>
# 5    d     5      d    <NA>
# 6    d     6      d    <NA>
# 7    e     7      e       1
# 8    c     8      c       2

For the sake of readability, like in R.

Diffraction answered 6/10, 2022 at 12:11 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.