case_when function from R to Python
Asked Answered
A

6

22

How I can implement the case_when function of R in a python code?

Here is the case_when function of R:

https://www.rdocumentation.org/packages/dplyr/versions/0.7.8/topics/case_when

as a minimum working example suppose we have the following dataframe (python code follows):

import pandas as pd
import numpy as np

data = {'name': ['Jason', 'Molly', 'Tina', 'Jake', 'Amy'], 
        'age': [42, 52, 36, 24, 73], 
        'preTestScore': [4, 24, 31, 2, 3],
        'postTestScore': [25, 94, 57, 62, 70]}
df = pd.DataFrame(data, columns = ['name', 'age', 'preTestScore', 'postTestScore'])
df

Suppose than we want to create an new column called 'elderly' that looks at the 'age' column and does the following:

if age < 10 then baby
 if age >= 10 and age < 20 then kid 
if age >=20 and age < 30 then young 
if age >= 30 and age < 50 then mature 
if age >= 50 then grandpa 

Can someone help on this ?

Anjaanjali answered 12/2, 2019 at 15:20 Comment(0)
B
43

You want to use np.select:

conditions = [
    (df["age"].lt(10)),
    (df["age"].ge(10) & df["age"].lt(20)),
    (df["age"].ge(20) & df["age"].lt(30)),
    (df["age"].ge(30) & df["age"].lt(50)),
    (df["age"].ge(50)),
]
choices = ["baby", "kid", "young", "mature", "grandpa"]

df["elderly"] = np.select(conditions, choices)

# Results in:
#      name  age  preTestScore  postTestScore  elderly
#  0  Jason   42             4             25   mature
#  1  Molly   52            24             94  grandpa
#  2   Tina   36            31             57   mature
#  3   Jake   24             2             62    young
#  4    Amy   73             3             70  grandpa

The conditions and choices lists must be the same length.
There is also a default parameter that is used when all conditions evaluate to False.

Ban answered 12/2, 2019 at 15:25 Comment(4)
thank you, and what if in the conditions want to have strings as opposed to numerical values. For example, if age == "fourty" then 'marure'. `' How your code could be modified?Anjaanjali
Then I’d use df[“age”].eq(“fourty”) as this allows us to continue using the bitwise &.Ban
How the code can be modified so as to put all other cases in one condition? For example, the 'greater than 50' condition is redundant essentially. So, maybe is better practice to have the last condition: 'anything else then grandpa'.Anjaanjali
There is a default parameter on np.select, this is used when all conditions evaluate to False: np.select(conditions, choices, default='Grandpa'). You could then remove the last elements from both the conditions and choices lists.Ban
B
12

np.select is great because it's a general way to assign values to elements in choicelist depending on conditions.

However, for the particular problem OP tries to solve, there is a succinct way to achieve the same with the pandas' cut method.


bin_cond = [-np.inf, 10, 20, 30, 50, np.inf]            # think of them as bin edges
bin_lab = ["baby", "kid", "young", "mature", "grandpa"] # the length needs to be len(bin_cond) - 1
df["elderly2"] = pd.cut(df["age"], bins=bin_cond, labels=bin_lab)

#     name  age  preTestScore  postTestScore  elderly elderly2
# 0  Jason   42             4             25   mature   mature
# 1  Molly   52            24             94  grandpa  grandpa
# 2   Tina   36            31             57   mature   mature
# 3   Jake   24             2             62    young    young
# 4    Amy   73             3             70  grandpa  grandpa
Benefaction answered 6/1, 2021 at 5:28 Comment(0)
E
3

pyjanitor has a case_when implementation in dev that could be helpful in this case, the implementation idea is inspired by if_else in pydatatable and fcase in R's data.table; under the hood, it uses pd.Series.mask:

# pip install git+https://github.com/pyjanitor-devs/pyjanitor.git
import pandas as pd
import janitor as jn

df.case_when(
df.age.lt(10), 'baby', # 1st condition, result
df.age.between(10, 20, 'left'), 'kid', # 2nd condition, result
df.age.between(20, 30, 'left'), 'young', # 3rd condition, result
 df.age.between(30, 50, 'left'), 'mature', # 4th condition, result
'grandpa',  # default if none of the conditions match
 column_name = 'elderly') # column name to assign to
 
    name  age  preTestScore  postTestScore  elderly
0  Jason   42             4             25   mature
1  Molly   52            24             94  grandpa
2   Tina   36            31             57   mature
3   Jake   24             2             62    young
4    Amy   73             3             70  grandpa

Alby's solution is more efficient for this use case than an if/else construct.

Extortionary answered 30/9, 2021 at 22:25 Comment(0)
B
3

Pandas has just released a case_when method.

caselist=[(df.age.lt(10), 'baby'), 
          (df.age.ge(10) & df.age.lt(20), 'kid'), 
          (df.age.ge(20) & df.age.lt(30), 'young'), 
          (df.age.ge(30) & df.age.lt(50), 'mature'), 
          (df.age.ge(50), 'grandpa')]

df.assign(elderly = df.age.case_when(caselist=caselist))
    name  age  preTestScore  postTestScore  elderly
0  Jason   42             4             25   mature
1  Molly   52            24             94  grandpa
2   Tina   36            31             57   mature
3   Jake   24             2             62    young
4    Amy   73             3             70  grandpa
Burgh answered 25/1 at 11:50 Comment(0)
C
1

Just for Future reference, Nowadays you could use pandas cut or map with moderate to good speeed. If you need something faster It might not suit your needs, but is good enough for daily use and batches.

import pandas as pd

If you wanna choose map or apply mount your ranges and return something if in range

def calc_grade(age):
        if 50 < age < 200:
            return 'Grandpa'
        elif 30 <= age <=50:
            return 'Mature'
        elif 20 <= age < 30:
            return 'Young'
        elif 10 <= age < 20:
            return 'Kid'
        elif age < 10:
            return 'Baby'

%timeit df['elderly'] = df['age'].map(calc_grade)
name age preTestScore postTestScore elderly
0 Jason 42 4 25 Mature
1 Molly 52 24 94 Grandpa
2 Tina 36 31 57 Mature
3 Jake 24 2 62 Young
4 Amy 73 3 70 Grandpa

393 µs ± 8.43 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


If you wanna choose cut there should be many options. One approach - We includes to left, exclude to the right . To each bin, one label.

bins = [0, 10, 20, 30, 50, 200] #200 year Vampires are people I guess...you could change to a date you belieave plausible.
labels = ['Baby','Kid','Young', 'Mature','Grandpa']

%timeit df['elderly'] = pd.cut(x=df.age, bins=bins, labels=labels , include_lowest=True, right=False, ordered=False)
name age preTestScore postTestScore elderly
0 Jason 42 4 25 Mature
1 Molly 52 24 94 Grandpa
2 Tina 36 31 57 Mature
3 Jake 24 2 62 Young
4 Amy 73 3 70 Grandpa
Calan answered 5/7, 2022 at 21:12 Comment(0)
C
0

Steady of numpy you can create a function and use map or apply with lambda:

def elderly_function(age):
 if age < 10:
  return 'baby'
 if age < 20:
  return 'kid'
 if age < 30
  return 'young'
 if age < 50:
  return 'mature'
 if age >= 50:
  return 'grandpa'

df["elderly"] = df["age"].map(lambda x: elderly_function(x))
# Works with apply as well:
df["elderly"] = df["age"].apply(lambda x: elderly_function(x))

The solution with numpy is probably fast and might be preferable if your df is considerably large.

Coccus answered 11/6, 2022 at 14:3 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.