How can I apply a custom function that adds new columns to a dataframe to a subset of existing columns?
Asked Answered
H

7

17

I am working with a large dataset where much of the data was entered twice. This means that many of the variables are represented by pairs of columns: column.1 with the data entered by one person, and column.2 where the same data was entered by a different person. I want to create a "master" column called simply column that first draws from column.1 and then, if column.1 is NA, draws from column.2.

Here is an example of what I am trying to do with made-up data:

mydata <- data.frame(name = c("Sarah","Ella","Carmen","Dinah","Billie"),
                     cheese.1 = c(1,4,NA,6,NA),
                     cheese.2 = c(1,4,3,5,NA),
                     milk.1 = c(NA,2,0,4,NA),
                     milk.2 = c(1,2,1,4,2),
                     tofu.1 = c("yum","yum",NA,"gross", NA),
                     tofu.2 = c("gross", "yum", "yum", NA, "gross"))

For example, the code below shows an example of what I want to do for a single pair of columns.

mydata %>% mutate(cheese = ifelse(is.na(cheese.1), cheese.2, cheese.1))

#OUTPUT:

    name cheese.1 cheese.2 milk.1 milk.2 tofu.1 tofu.2 cheese
1  Sarah        1        1     NA      1    yum  gross      1
2   Ella        4        4      2      2    yum    yum      4
3 Carmen       NA        3      0      1   <NA>    yum      3
4  Dinah        6        5      4      4  gross   <NA>      6
5 Billie       NA       NA     NA      2   <NA>  gross     NA

However, I want to automate the process rather than doing each manually. Below is my attempt at automating the process, using a list (col.list) of the column pairs for which I want to create new "master" columns:

col.list = c("cheese","milk","tofu")

lapply(col.list, FUN = function(x) {
  v <- as.name({{x}})
  v.1 <- as.name(paste0({{x}}, ".1"))
  v.2 <- as.name(paste0(({{x}}), ".2"))
  mydata %>% mutate(v = ifelse(is.na({{v.1}}), {{v.2}}, {{v.1}}))
})

#OUTPUT:

[[1]]
    name cheese.1 cheese.2 milk.1 milk.2 tofu.1 tofu.2  v
1  Sarah        1        1     NA      1    yum  gross  1
2   Ella        4        4      2      2    yum    yum  4
3 Carmen       NA        3      0      1   <NA>    yum  3
4  Dinah        6        5      4      4  gross   <NA>  6
5 Billie       NA       NA     NA      2   <NA>  gross NA

[[2]]
    name cheese.1 cheese.2 milk.1 milk.2 tofu.1 tofu.2 v
1  Sarah        1        1     NA      1    yum  gross 1
2   Ella        4        4      2      2    yum    yum 2
3 Carmen       NA        3      0      1   <NA>    yum 0
4  Dinah        6        5      4      4  gross   <NA> 4
5 Billie       NA       NA     NA      2   <NA>  gross 2

[[3]]
    name cheese.1 cheese.2 milk.1 milk.2 tofu.1 tofu.2     v
1  Sarah        1        1     NA      1    yum  gross   yum
2   Ella        4        4      2      2    yum    yum   yum
3 Carmen       NA        3      0      1   <NA>    yum   yum
4  Dinah        6        5      4      4  gross   <NA> gross
5 Billie       NA       NA     NA      2   <NA>  gross gross

The problems with this attempt are:

  1. the new columns are not correctly named (they should be named cheese, milk and tofu rather than all be called v)
  2. the new columns are not added to the original data frame. What I want is for the program to add a series of new "master" columns to my dataframe (one new column for each pair of columns identified in col.list).
Hasheem answered 29/10, 2021 at 21:30 Comment(3)
I'm interested in a solution using coalesce which I couldn't use with "dynamic" column names. Perhaps someone reading this could give some insights?Ovida
@Martin Gal, please have a look at my solution. It‘s not with dynamic column names, but at least with coalesce.Townswoman
@MartinGal I added an option using exec and coalesce. It has dynamic names, but only with the help of str_glue().Danseuse
O
9

(1) You have to wrap v into the curly-curly operator and use :=:

library(dplyr)

col.list <- c("cheese","milk","tofu")

lapply(col.list, FUN = function(x) {
  v <- as.name({{x}})
  v.1 <- as.name(paste0({{x}}, ".1"))
  v.2 <- as.name(paste0(({{x}}), ".2"))
  mydata %>% mutate({{ v }} = ifelse(is.na({{v.1}}), {{v.2}}, {{v.1}}))
})

returns

[[1]]
    name cheese.1 cheese.2 milk.1 milk.2 tofu.1 tofu.2 cheese
1  Sarah        1        1     NA      1    yum  gross      1
2   Ella        4        4      2      2    yum    yum      4
3 Carmen       NA        3      0      1   <NA>    yum      3
4  Dinah        6        5      4      4  gross   <NA>      6
5 Billie       NA       NA     NA      2   <NA>  gross     NA

[...]

which is one step closer to your desired output.

(2) But to get your desired output, I suggest using purrr:

library(purrr)
library(dplyr)

col.list %>% 
  map(~mydata %>% 
        select(name, starts_with(.x)) %>% 
        mutate({{ .x }} := ifelse(
          is.na(!!sym(paste0(.x, ".1"))), 
          !!sym(paste0(.x, ".2")), 
          !!sym(paste0(.x, ".1"))
          )
        )
  ) %>% 
  reduce(left_join, by = "name")

This returns

    name cheese.1 cheese.2 cheese milk.1 milk.2 milk tofu.1 tofu.2  tofu
1  Sarah        1        1      1     NA      1    1    yum  gross   yum
2   Ella        4        4      4      2      2    2    yum    yum   yum
3 Carmen       NA        3      3      0      1    0   <NA>    yum   yum
4  Dinah        6        5      6      4      4    4  gross   <NA> gross
5 Billie       NA       NA     NA     NA      2    2   <NA>  gross gross
Ovida answered 29/10, 2021 at 22:22 Comment(4)
Great answer. I tried to find an rlang answer but eventually gave up.Characterize
@Characterize Your answer is far more elegant and (I guess) faster.Ovida
Up for the challenge as always my friend :)Lipocaic
I only have one suggestion because long ago Anil told me and I tried it ever since. I would always prefer to not hard-code the solution because we don't know have many of these might be in the real data set.Lipocaic
D
7

Here is a pretty simple and dynamic option. Since it uses tidyselect, if there are more than just two columns (eg cheese.1, cheese.2, and cheese.3) this will still work. This will also work if the column groups are unbalanced (eg 3 cheese columns, but only 2 milk columns):

library(purrr)
library(stringr)
library(rlang)
library(dplyr)

col.list <- c("cheese","milk","tofu")

express <- map(set_names(col.list), ~ 
                 str_glue("coalesce(!!!across(starts_with(\"{.x}\")))") %>% 
                 parse_expr())

mydata %>%
  mutate(!!! express, .keep = "unused")

Output

The other columns were removed by .keep = "unused". If you want to keep all the columns then delete that argument.

    name cheese milk  tofu
1  Sarah      1    1   yum
2   Ella      4    2   yum
3 Carmen      3    0   yum
4  Dinah      6    4 gross
5 Billie     NA    2 gross

How it works

  1. The use of map and set_names is important because this creates a named list, which is important for the big-bang !!! operator later. map creates a named list of expressions.
  2. The use of across and coalesce allows the dynamic tidy-selection of columns.
  3. The !!! operator force-splices the list of objects and the names for the columns are from the list names set up using map and set_names.
Danseuse answered 29/10, 2021 at 23:1 Comment(1)
I like using glue package here :)Lipocaic
L
5

Here is another way of doing this amid all the great answers you got:

library(dplyr)
library(purrr)

col.list %>%
  reduce(~ .x %>% 
           bind_cols(mydata %>% 
                       select(starts_with(.y)) %>%
                       mutate(!!gsub("(\\D+)\\.\\d+", "\\1", .y) := invoke(coalesce, cur_data()))), 
         .init = NULL)

  cheese.1 cheese.2 cheese milk.1 milk.2 milk tofu.1 tofu.2  tofu
1        1        1      1     NA      1    1    yum  gross   yum
2        4        4      4      2      2    2    yum    yum   yum
3       NA        3      3      0      1    0   <NA>    yum   yum
4        6        5      6      4      4    4  gross   <NA> gross
5       NA       NA     NA     NA      2    2   <NA>  gross gross
Lipocaic answered 14/11, 2021 at 13:11 Comment(4)
I thought, := and !! are exported from rlang into dplyr. ;-)Ovida
Yes Martin you are right. I will edit the description :)Lipocaic
FYI, invoke is soft-depreciated in favor of exec.Danseuse
@Danseuse Yes exactly, I went a bit old school here :) Thanks for the upvote, I already did yours.Lipocaic
G
3

Here is one way I would do it. First convert to long format then reshape back to wide format but having only 2 value columns 1 and 2

library(dplyr)
library(tidyr)

mydata <- data.frame(name = c("Sarah","Ella","Carmen","Dinah","Billie"),
                     cheese.1 = c(1,4,NA,6,NA),
                     cheese.2 = c(1,4,3,5,NA),
                     milk.1 = c(NA,2,0,4,NA),
                     milk.2 = c(1,2,1,4,2),
                     tofu.1 = c("yum","yum",NA,"gross", NA),
                     tofu.2 = c("gross", "yum", "yum", NA, "gross"))
mydata_long <- mydata %>% 
  mutate(across(where(is.numeric), as.character)) %>% 
  pivot_longer(-name,
               names_to = c("food", "nr"),
               names_sep = "\\.")
mydata_long
#> # A tibble: 30 x 4
#>    name  food   nr    value
#>    <chr> <chr>  <chr> <chr>
#>  1 Sarah cheese 1     1    
#>  2 Sarah cheese 2     1    
#>  3 Sarah milk   1     <NA> 
#>  4 Sarah milk   2     1    
#>  5 Sarah tofu   1     yum  
#>  6 Sarah tofu   2     gross
#>  7 Ella  cheese 1     4    
#>  8 Ella  cheese 2     4    
#>  9 Ella  milk   1     2    
#> 10 Ella  milk   2     2    
#> # ... with 20 more rows

Apply ifelse() function after transforming back to different wide format

mydata_wide <- mydata_long %>% 
  pivot_wider(names_from = nr,
              values_from = value) %>% 
  mutate(final_val = ifelse(is.na(`1`), `2`, `1`)) %>% 
  arrange(food)
mydata_wide
#> # A tibble: 15 x 5
#>    name   food   `1`   `2`   final_val
#>    <chr>  <chr>  <chr> <chr> <chr>    
#>  1 Sarah  cheese 1     1     1        
#>  2 Ella   cheese 4     4     4        
#>  3 Carmen cheese <NA>  3     3        
#>  4 Dinah  cheese 6     5     6        
#>  5 Billie cheese <NA>  <NA>  <NA>     
#>  6 Sarah  milk   <NA>  1     1        
#>  7 Ella   milk   2     2     2        
#>  8 Carmen milk   0     1     0        
#>  9 Dinah  milk   4     4     4        
#> 10 Billie milk   <NA>  2     2        
#> 11 Sarah  tofu   yum   gross yum      
#> 12 Ella   tofu   yum   yum   yum      
#> 13 Carmen tofu   <NA>  yum   yum      
#> 14 Dinah  tofu   gross <NA>  gross    
#> 15 Billie tofu   <NA>  gross gross
mydata_wide2 <- mydata_wide %>% 
  pivot_wider(-c(`1`, `2`),
              names_from = food,
              values_from = final_val) 
mydata_wide2
#> # A tibble: 5 x 4
#>   name   cheese milk  tofu 
#>   <chr>  <chr>  <chr> <chr>
#> 1 Sarah  1      1     yum  
#> 2 Ella   4      2     yum  
#> 3 Carmen 3      0     yum  
#> 4 Dinah  6      4     gross
#> 5 Billie <NA>   2     gross

Created on 2021-10-29 by the reprex package (v2.0.1)

Gardia answered 29/10, 2021 at 22:18 Comment(0)
C
3

I would use purrr::map_dfc and coalesce here. Looks pretty straightforward.

library(purrr)
library(dplyr)
library(stringr)

mydata %>% mutate(map2_dfc(select(., ends_with('1')),
                           select(., ends_with('2')),
                           ~coalesce(.x, .y)))%>%
  select(-ends_with('2'))%>%
  rename_with(~str_remove(.x, '\\.\\d+$'))

    name cheese milk  tofu
1  Sarah      1    1   yum
2   Ella      4    2   yum
3 Carmen      3    0   yum
4  Dinah      6    4 gross
5 Billie     NA    2 gross
Characterize answered 29/10, 2021 at 22:32 Comment(1)
Great answer my friend :)Lipocaic
A
3

Here is how you can achieve your task:

  1. define your pairs (in case you have hundreds of columns, this could be automated.
  2. use imap_dfc to apply coalesce do the defined pairs
  3. bind to original dataframe
library(dplyr)
library(purrr)

pairs <- list(cheese = c(2, 3), milk = c(4, 5), tofu = c(6, 7))

imap_dfc(pairs, ~mydata[, .x] %>% transmute(!!.y := coalesce(!!!syms(names(mydata)[.x])))) %>% 
  bind_cols(mydata)
  cheese milk  tofu   name cheese.1 cheese.2 milk.1 milk.2 tofu.1 tofu.2
1      1    1   yum  Sarah        1        1     NA      1    yum  gross
2      4    2   yum   Ella        4        4      2      2    yum    yum
3      3    0   yum Carmen       NA        3      0      1   <NA>    yum
4      6    4 gross  Dinah        6        5      4      4  gross   <NA>
5     NA    2 gross Billie       NA       NA     NA      2   <NA>  gross
Arrange answered 29/10, 2021 at 23:5 Comment(3)
Lot's of nice ways to solve this one.Ovida
In this case your answer is best. And I think Guedes already answered. So this time you can relax and enjoy!Arrange
This is also great my friend:)Lipocaic
T
1

Another tidyverse option. Advantage here is that it keeps the original data type and doesn‘t convert everything to character values.

library(tidyverse)
mydata %>%
  pivot_longer(cols = -name,
               names_pattern = '(.*)(\\..)',
               names_to = c('.value', 'number')) %>%
  group_by(name) %>%
  mutate(across(-number, ~if_else(is.na(.[1]), .[2], .[1]))) %>%
  ungroup() %>%
  filter(number == '.1') %>%
  select(-number)

Which gives

# A tibble: 5 x 4
  name   cheese  milk tofu 
  <chr>   <dbl> <dbl> <chr>
1 Sarah       1     1 yum  
2 Ella        4     2 yum  
3 Carmen      3     0 yum  
4 Dinah       6     4 gross
5 Billie     NA     2 gross

Alternative solution with coalesce:

mydata %>%
  pivot_longer(cols = -name,
               names_pattern = '(.*)(\\..)',
               names_to = c('.value', 'number')) %>%
  group_by(name) %>%
  mutate(across(-number, ~coalesce(.[1], .[2]))) %>%
  ungroup() %>%
  filter(number == '.1') %>%
  select(-number)
Townswoman answered 29/10, 2021 at 22:34 Comment(1)
Yeah I thought of something like this first but wanted some dynamic columns in the coalesce function. I guess @Characterize showed a proper solution for that one. :-)Ovida

© 2022 - 2024 — McMap. All rights reserved.