How can I apply a custom function that adds new columns to a dataframe to a subset of existing columns?

Asked 29/10, 2021 at 21:30 Answered 14/11, 2021 at 13:11

I am working with a large dataset where much of the data was entered twice. This means that many of the variables are represented by pairs of columns: column.1 with the data entered by one person, and column.2 where the same data was entered by a different person. I want to create a "master" column called simply column that first draws from column.1 and then, if column.1 is NA, draws from column.2.

Here is an example of what I am trying to do with made-up data:

mydata <- data.frame(name = c("Sarah","Ella","Carmen","Dinah","Billie"),
                     cheese.1 = c(1,4,NA,6,NA),
                     cheese.2 = c(1,4,3,5,NA),
                     milk.1 = c(NA,2,0,4,NA),
                     milk.2 = c(1,2,1,4,2),
                     tofu.1 = c("yum","yum",NA,"gross", NA),
                     tofu.2 = c("gross", "yum", "yum", NA, "gross"))

For example, the code below shows an example of what I want to do for a single pair of columns.

mydata %>% mutate(cheese = ifelse(is.na(cheese.1), cheese.2, cheese.1))

#OUTPUT:

    name cheese.1 cheese.2 milk.1 milk.2 tofu.1 tofu.2 cheese
1  Sarah        1        1     NA      1    yum  gross      1
2   Ella        4        4      2      2    yum    yum      4
3 Carmen       NA        3      0      1   <NA>    yum      3
4  Dinah        6        5      4      4  gross   <NA>      6
5 Billie       NA       NA     NA      2   <NA>  gross     NA

However, I want to automate the process rather than doing each manually. Below is my attempt at automating the process, using a list (col.list) of the column pairs for which I want to create new "master" columns:

col.list = c("cheese","milk","tofu")

lapply(col.list, FUN = function(x) {
  v <- as.name({{x}})
  v.1 <- as.name(paste0({{x}}, ".1"))
  v.2 <- as.name(paste0(({{x}}), ".2"))
  mydata %>% mutate(v = ifelse(is.na({{v.1}}), {{v.2}}, {{v.1}}))
})

#OUTPUT:

[[1]]
    name cheese.1 cheese.2 milk.1 milk.2 tofu.1 tofu.2  v
1  Sarah        1        1     NA      1    yum  gross  1
2   Ella        4        4      2      2    yum    yum  4
3 Carmen       NA        3      0      1   <NA>    yum  3
4  Dinah        6        5      4      4  gross   <NA>  6
5 Billie       NA       NA     NA      2   <NA>  gross NA

[[2]]
    name cheese.1 cheese.2 milk.1 milk.2 tofu.1 tofu.2 v
1  Sarah        1        1     NA      1    yum  gross 1
2   Ella        4        4      2      2    yum    yum 2
3 Carmen       NA        3      0      1   <NA>    yum 0
4  Dinah        6        5      4      4  gross   <NA> 4
5 Billie       NA       NA     NA      2   <NA>  gross 2

[[3]]
    name cheese.1 cheese.2 milk.1 milk.2 tofu.1 tofu.2     v
1  Sarah        1        1     NA      1    yum  gross   yum
2   Ella        4        4      2      2    yum    yum   yum
3 Carmen       NA        3      0      1   <NA>    yum   yum
4  Dinah        6        5      4      4  gross   <NA> gross
5 Billie       NA       NA     NA      2   <NA>  gross gross

The problems with this attempt are:

the new columns are not correctly named (they should be named cheese, milk and tofu rather than all be called v)
the new columns are not added to the original data frame. What I want is for the program to add a series of new "master" columns to my dataframe (one new column for each pair of columns identified in col.list).

Hasheem answered 29/10, 2021 at 21:30 Comment(3)

I'm interested in a solution using coalesce which I couldn't use with "dynamic" column names. Perhaps someone reading this could give some insights? – Ovida 29/10, 2021 at 22:24

@Martin Gal, please have a look at my solution. It‘s not with dynamic column names, but at least with coalesce. – Townswoman 29/10, 2021 at 22:41

@MartinGal I added an option using exec and coalesce. It has dynamic names, but only with the help of str_glue(). – Danseuse 29/10, 2021 at 23:10

(1) You have to wrap v into the curly-curly operator and use :=:

library(dplyr)

col.list <- c("cheese","milk","tofu")

lapply(col.list, FUN = function(x) {
  v <- as.name({{x}})
  v.1 <- as.name(paste0({{x}}, ".1"))
  v.2 <- as.name(paste0(({{x}}), ".2"))
  mydata %>% mutate({{ v }} = ifelse(is.na({{v.1}}), {{v.2}}, {{v.1}}))
})

returns

[[1]]
    name cheese.1 cheese.2 milk.1 milk.2 tofu.1 tofu.2 cheese
1  Sarah        1        1     NA      1    yum  gross      1
2   Ella        4        4      2      2    yum    yum      4
3 Carmen       NA        3      0      1   <NA>    yum      3
4  Dinah        6        5      4      4  gross   <NA>      6
5 Billie       NA       NA     NA      2   <NA>  gross     NA

[...]

which is one step closer to your desired output.

(2) But to get your desired output, I suggest using purrr:

library(purrr)
library(dplyr)

col.list %>% 
  map(~mydata %>% 
        select(name, starts_with(.x)) %>% 
        mutate({{ .x }} := ifelse(
          is.na(!!sym(paste0(.x, ".1"))), 
          !!sym(paste0(.x, ".2")), 
          !!sym(paste0(.x, ".1"))
          )
        )
  ) %>% 
  reduce(left_join, by = "name")

This returns

    name cheese.1 cheese.2 cheese milk.1 milk.2 milk tofu.1 tofu.2  tofu
1  Sarah        1        1      1     NA      1    1    yum  gross   yum
2   Ella        4        4      4      2      2    2    yum    yum   yum
3 Carmen       NA        3      3      0      1    0   <NA>    yum   yum
4  Dinah        6        5      6      4      4    4  gross   <NA> gross
5 Billie       NA       NA     NA     NA      2    2   <NA>  gross gross

Ovida answered 29/10, 2021 at 22:22 Comment(4)

Great answer. I tried to find an rlang answer but eventually gave up. – Characterize 29/10, 2021 at 22:37

@Characterize Your answer is far more elegant and (I guess) faster. – Ovida 29/10, 2021 at 22:43

Up for the challenge as always my friend :) – Lipocaic 14/11, 2021 at 13:12

I only have one suggestion because long ago Anil told me and I tried it ever since. I would always prefer to not hard-code the solution because we don't know have many of these might be in the real data set. – Lipocaic 14/11, 2021 at 16:39

Here is a pretty simple and dynamic option. Since it uses tidyselect, if there are more than just two columns (eg cheese.1, cheese.2, and cheese.3) this will still work. This will also work if the column groups are unbalanced (eg 3 cheese columns, but only 2 milk columns):

library(purrr)
library(stringr)
library(rlang)
library(dplyr)

col.list <- c("cheese","milk","tofu")

express <- map(set_names(col.list), ~ 
                 str_glue("coalesce(!!!across(starts_with(\"{.x}\")))") %>% 
                 parse_expr())

mydata %>%
  mutate(!!! express, .keep = "unused")

Output

The other columns were removed by .keep = "unused". If you want to keep all the columns then delete that argument.

    name cheese milk  tofu
1  Sarah      1    1   yum
2   Ella      4    2   yum
3 Carmen      3    0   yum
4  Dinah      6    4 gross
5 Billie     NA    2 gross

How it works

The use of map and set_names is important because this creates a named list, which is important for the big-bang !!! operator later. map creates a named list of expressions.
The use of across and coalesce allows the dynamic tidy-selection of columns.
The !!! operator force-splices the list of objects and the names for the columns are from the list names set up using map and set_names.

Danseuse answered 29/10, 2021 at 23:1 Comment(1)

I like using glue package here :) – Lipocaic 14/11, 2021 at 13:12

Here is another way of doing this amid all the great answers you got:

library(dplyr)
library(purrr)

col.list %>%
  reduce(~ .x %>% 
           bind_cols(mydata %>% 
                       select(starts_with(.y)) %>%
                       mutate(!!gsub("(\\D+)\\.\\d+", "\\1", .y) := invoke(coalesce, cur_data()))), 
         .init = NULL)

  cheese.1 cheese.2 cheese milk.1 milk.2 milk tofu.1 tofu.2  tofu
1        1        1      1     NA      1    1    yum  gross   yum
2        4        4      4      2      2    2    yum    yum   yum
3       NA        3      3      0      1    0   <NA>    yum   yum
4        6        5      6      4      4    4  gross   <NA> gross
5       NA       NA     NA     NA      2    2   <NA>  gross gross

Lipocaic answered 14/11, 2021 at 13:11 Comment(4)

I thought, := and !! are exported from rlang into dplyr. ;-) – Ovida 14/11, 2021 at 14:10

Yes Martin you are right. I will edit the description :) – Lipocaic 14/11, 2021 at 16:37

FYI, invoke is soft-depreciated in favor of exec. – Danseuse 18/11, 2021 at 20:13

@Danseuse Yes exactly, I went a bit old school here :) Thanks for the upvote, I already did yours. – Lipocaic 18/11, 2021 at 20:17

Here is one way I would do it. First convert to long format then reshape back to wide format but having only 2 value columns 1 and 2

library(dplyr)
library(tidyr)

mydata <- data.frame(name = c("Sarah","Ella","Carmen","Dinah","Billie"),
                     cheese.1 = c(1,4,NA,6,NA),
                     cheese.2 = c(1,4,3,5,NA),
                     milk.1 = c(NA,2,0,4,NA),
                     milk.2 = c(1,2,1,4,2),
                     tofu.1 = c("yum","yum",NA,"gross", NA),
                     tofu.2 = c("gross", "yum", "yum", NA, "gross"))
mydata_long <- mydata %>% 
  mutate(across(where(is.numeric), as.character)) %>% 
  pivot_longer(-name,
               names_to = c("food", "nr"),
               names_sep = "\\.")
mydata_long
#> # A tibble: 30 x 4
#>    name  food   nr    value
#>    <chr> <chr>  <chr> <chr>
#>  1 Sarah cheese 1     1    
#>  2 Sarah cheese 2     1    
#>  3 Sarah milk   1     <NA> 
#>  4 Sarah milk   2     1    
#>  5 Sarah tofu   1     yum  
#>  6 Sarah tofu   2     gross
#>  7 Ella  cheese 1     4    
#>  8 Ella  cheese 2     4    
#>  9 Ella  milk   1     2    
#> 10 Ella  milk   2     2    
#> # ... with 20 more rows

Apply ifelse() function after transforming back to different wide format

mydata_wide <- mydata_long %>% 
  pivot_wider(names_from = nr,
              values_from = value) %>% 
  mutate(final_val = ifelse(is.na(`1`), `2`, `1`)) %>% 
  arrange(food)
mydata_wide
#> # A tibble: 15 x 5
#>    name   food   `1`   `2`   final_val
#>    <chr>  <chr>  <chr> <chr> <chr>    
#>  1 Sarah  cheese 1     1     1        
#>  2 Ella   cheese 4     4     4        
#>  3 Carmen cheese <NA>  3     3        
#>  4 Dinah  cheese 6     5     6        
#>  5 Billie cheese <NA>  <NA>  <NA>     
#>  6 Sarah  milk   <NA>  1     1        
#>  7 Ella   milk   2     2     2        
#>  8 Carmen milk   0     1     0        
#>  9 Dinah  milk   4     4     4        
#> 10 Billie milk   <NA>  2     2        
#> 11 Sarah  tofu   yum   gross yum      
#> 12 Ella   tofu   yum   yum   yum      
#> 13 Carmen tofu   <NA>  yum   yum      
#> 14 Dinah  tofu   gross <NA>  gross    
#> 15 Billie tofu   <NA>  gross gross

mydata_wide2 <- mydata_wide %>% 
  pivot_wider(-c(`1`, `2`),
              names_from = food,
              values_from = final_val) 
mydata_wide2
#> # A tibble: 5 x 4
#>   name   cheese milk  tofu 
#>   <chr>  <chr>  <chr> <chr>
#> 1 Sarah  1      1     yum  
#> 2 Ella   4      2     yum  
#> 3 Carmen 3      0     yum  
#> 4 Dinah  6      4     gross
#> 5 Billie <NA>   2     gross

^{Created on 2021-10-29 by the reprex package (v2.0.1)}

Gardia answered 29/10, 2021 at 22:18 Comment(0)

I would use purrr::map_dfc and coalesce here. Looks pretty straightforward.

library(purrr)
library(dplyr)
library(stringr)

mydata %>% mutate(map2_dfc(select(., ends_with('1')),
                           select(., ends_with('2')),
                           ~coalesce(.x, .y)))%>%
  select(-ends_with('2'))%>%
  rename_with(~str_remove(.x, '\\.\\d+$'))

    name cheese milk  tofu
1  Sarah      1    1   yum
2   Ella      4    2   yum
3 Carmen      3    0   yum
4  Dinah      6    4 gross
5 Billie     NA    2 gross

Characterize answered 29/10, 2021 at 22:32 Comment(1)

Great answer my friend :) – Lipocaic 14/11, 2021 at 13:11

Here is how you can achieve your task:

define your pairs (in case you have hundreds of columns, this could be automated.
use imap_dfc to apply coalesce do the defined pairs
bind to original dataframe

library(dplyr)
library(purrr)

pairs <- list(cheese = c(2, 3), milk = c(4, 5), tofu = c(6, 7))

imap_dfc(pairs, ~mydata[, .x] %>% transmute(!!.y := coalesce(!!!syms(names(mydata)[.x])))) %>% 
  bind_cols(mydata)

  cheese milk  tofu   name cheese.1 cheese.2 milk.1 milk.2 tofu.1 tofu.2
1      1    1   yum  Sarah        1        1     NA      1    yum  gross
2      4    2   yum   Ella        4        4      2      2    yum    yum
3      3    0   yum Carmen       NA        3      0      1   <NA>    yum
4      6    4 gross  Dinah        6        5      4      4  gross   <NA>
5     NA    2 gross Billie       NA       NA     NA      2   <NA>  gross

Arrange answered 29/10, 2021 at 23:5 Comment(3)

Lot's of nice ways to solve this one. – Ovida 29/10, 2021 at 23:19

In this case your answer is best. And I think Guedes already answered. So this time you can relax and enjoy! – Arrange 29/10, 2021 at 23:21

This is also great my friend:) – Lipocaic 14/11, 2021 at 13:12

Another tidyverse option. Advantage here is that it keeps the original data type and doesn‘t convert everything to character values.

library(tidyverse)
mydata %>%
  pivot_longer(cols = -name,
               names_pattern = '(.*)(\\..)',
               names_to = c('.value', 'number')) %>%
  group_by(name) %>%
  mutate(across(-number, ~if_else(is.na(.[1]), .[2], .[1]))) %>%
  ungroup() %>%
  filter(number == '.1') %>%
  select(-number)

Which gives

# A tibble: 5 x 4
  name   cheese  milk tofu 
  <chr>   <dbl> <dbl> <chr>
1 Sarah       1     1 yum  
2 Ella        4     2 yum  
3 Carmen      3     0 yum  
4 Dinah       6     4 gross
5 Billie     NA     2 gross

Alternative solution with coalesce:

mydata %>%
  pivot_longer(cols = -name,
               names_pattern = '(.*)(\\..)',
               names_to = c('.value', 'number')) %>%
  group_by(name) %>%
  mutate(across(-number, ~coalesce(.[1], .[2]))) %>%
  ungroup() %>%
  filter(number == '.1') %>%
  select(-number)

Townswoman answered 29/10, 2021 at 22:34 Comment(1)

Yeah I thought of something like this first but wanted some dynamic columns in the coalesce function. I guess @Characterize showed a proper solution for that one. :-) – Ovida 29/10, 2021 at 22:45

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags