Equivalent Python code for mutate_if from tidyverse
Asked Answered
A

3

6

I'm an avid R user and am learning python along the way. One of the example code that I can easily run in R is perplexing me in Python.

Here's the original data (constructed within R):

library(tidyverse)


df <- tribble(~name, ~age, ~gender, ~height_in,
        "john",20,'m',66,
        'mary',NA,'f',62,
        NA,38,'f',68,
        'larry',NA,NA,NA
)

The output of this looks like this:

df

# A tibble: 4 x 4
  name    age gender height_in
  <chr> <dbl> <chr>      <dbl>
1 john     20 m             66
2 mary     NA f             62
3 NA       38 f             68
4 larry    NA NA            NA

I want to do 3 things:

  1. I want to replace the NA values in columns that are characters with the value "zz"
  2. I want to replace the NA values in columns that are numeric with the value 0
  3. I want to convert the character columns to factors.

Here's how I did it in R (again, using the tidyverse package):

tmp <- df %>%
  mutate_if(is.character, function(x) ifelse(is.na(x),"zz",x)) %>%
  mutate_if(is.character, as.factor) %>%
  mutate_if(is.numeric, function(x) ifelse(is.na(x), 0, x))

Here's the output of the dataframe tmp:

tmp

# A tibble: 4 x 4
  name    age gender height_in
  <fct> <dbl> <fct>      <dbl>
1 john     20 m             66
2 mary      0 f             62
3 zz       38 f             68
4 larry     0 zz             0

I'm familiar with if() and else() statements within Python. What I don't know is the correct and most readable way of executing the above code within Python. I'm guessing that there is no mutate_if equivalent in the pandas package. My question is what is the similar code that I can use in python that mimics the mutate_if, is.character, is.numeric, and as.factor functions found within tidyverse and R?

On a side note, I'm not as interested in speed/efficiency of code execution, but rather readability - which is why I really enjoy tidyverse. I would be grateful for any tips or suggestions.

Edit 1: adding code to create a pandas dataframe

Here is the code I used to create the dataframe within Python. This may assist others in getting started.

import pandas as pd
import numpy as np

my_dict = {
    'name' : ['john','mary', np.nan, 'larry'],
    'age' : [20, np.nan, 38,  np.nan],
    'gender' : ['m','f','f', np.nan],
    'height_in' : [66, 62, 68, np.nan]
}

df = pd.DataFrame(my_dict)

The output of this should be similar:

print(df)
    name   age gender  height_in
0   john  20.0      m       66.0
1   mary   NaN      f       62.0
2    NaN  38.0      f       68.0
3  larry   NaN    NaN        NaN
Alms answered 26/2, 2019 at 18:31 Comment(0)
A
1

Well, after some sleep, I think I have it figured out.

Here's the code I used to take the pandas dataframe and apply the comparable mutate_if functions I mentioned earlier to get the same results.

# fill in the missing values (similar to mutate_if from tidyverse)
df1 = df.select_dtypes(include=['double']).fillna(0)
df2 = df.select_dtypes(include=['object']).fillna('zz').astype('category')

df = pd.concat([df2.reset_index(drop = True), df1], axis = 1)

print(df)
    name gender   age  height_in
0   john      m  20.0       66.0
1   mary      f   0.0       62.0
2     zz      f  38.0       68.0
3  larry     zz   0.0        0.0

# check again for the data types
df.dtypes
name         category
gender       category
age           float64
height_in     float64
dtype: object

The catch is that I had to 'break' apart the original dataframe, apply the changes (i.e., fill in the missing values and change data types), and then recombine the columns (i.e., put the data frame back together).

Alms answered 27/2, 2019 at 16:56 Comment(0)
S
1

Attempt at an old question; it seems a combination of replace(for the string characters) and fillna(for the numeric) could suffice here:

df.replace({None:'zz'}).fillna(0, downcast='infer') 
     name  age gender  height_in
0   john   20      m         66
1   mary    0      f         62
2     zz   38      f         68
3  larry    0     zz          0

To convert name to categorical dtype, assign or pyjanitor's encode_categorical are possible options:

(df.replace({None:'zz'})
   .fillna(0, downcast='infer') 
   .assign(name = lambda df: df.astype('category')
)

With pyjanitor:

# pip install git+https://github.com/pyjanitor-devs/pyjanitor.git
import pandas as pd
import janitor
(df.replace({None:'zz'})
   .fillna(0, downcast='infer') 
   .encode_categorical('name')
)
Severally answered 14/11, 2021 at 3:54 Comment(0)
U
0

What about a way that aligns to the tidyverse way:

>>> from datar import f
>>> from datar.tibble import tribble
>>> from datar.base import NA, is_na, is_numeric, is_character, as_factor
>>> from datar.dplyr import mutate, across, where
>>> from datar.tidyr import replace_na
>>> # or if you are lazy
>>> # from datar.all import *
>>> 
>>> df = tribble(
...     f.name, f.age, f.gender, f.height_in,
...     "john", 20,    'm',      66,
...     'mary', NA,    'f',      62,
...     NA,     38,    'f',      68,
...     'larry',NA,    NA,       NA
... )
>>> 
>>> tmp = df >> \
...   mutate(across(where(is_character), replace_na, "zz")) >> \
...   mutate(across(where(is_character), as_factor)) >> \
...   mutate(across(where(is_numeric), replace_na, 0))
>>> 
>>> tmp
        name       age     gender  height_in
  <category> <float64> <category>  <float64>
0       john      20.0          m       66.0
1       mary       0.0          f       62.0
2         zz      38.0          f       68.0
3      larry       0.0         zz        0.0

I am the author of the datar package. Please feel free to submit issues if you have any questions about using it.

Unwished answered 28/4, 2021 at 23:24 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.