How to use map from purrr with dplyr::mutate to create multiple new columns based on column pairs
Asked Answered
L

10

23

I have to following issue using R. In short I want to create multiple new columns in a data frame based on calculations of different column pairs in the data frame.

The data looks as follows:

df <- data.frame(a1 = c(1:5), 
                 b1 = c(4:8), 
                 c1 = c(10:14), 
                 a2 = c(9:13), 
                 b2 = c(3:7), 
                 c2 = c(15:19))
df
a1 b1 c1 a2 b2 c2
1  4 10  9  3 15
2  5 11 10  4 16
3  6 12 11  5 17
4  7 13 12  6 18
5  8 14 13  7 19

The output is supposed to look like the following:

a1 b1 c1 a2 b2 c2 sum_a sum_b sum_c
1  4 10  9  3 15    10     7    25
2  5 11 10  4 16    12     9    27
4  7 13 12  6 18    16    13    31
5  8 14 13  7 19    18    15    33

I can achieve this using dplyr doing some manual work in the following way:

df %>% rowwise %>% mutate(sum_a = sum(a1, a2),
                          sum_b = sum(b1, b2),
                          sum_c = sum(c1, c2)) %>% 
  as.data.frame()

So what is being done is: take columns with the letter "a" in it, calulate the sum rowwise, and create a new column with the sum named sum_[letter]. Repeat for columns with different letters.

This is working, however, if I have a large data set with say 300 different column pairs the manual input would be significant, since I would have to write 300 mutate calls.

I recently stumbled upon the R package "purrr" and my guess is that this would solve my problem of doing what I want in a more automated way.

In particular, I would think to be able to use purrr:map2 to which I pass two lists of column names.

  • list1 = all columns with the number 1 in it
  • list2 = all columns with the number 2 in it

Then I could calculate the sum of each matching list entry, in the form of:

map2(list1, list2, ~mutate(sum))

However, I am not able to figure out how to best approach this problem using purrr. I am rather new to using purrr, so I would really appreciate any help on this issue.

Leboff answered 13/4, 2018 at 12:8 Comment(3)
Are the column names going to become ... aa1, aa2, ab1, ab2 etc after you have 54 columns?Sarena
I see the answer has been edited to reflect the above query. On the lack of a tidy solution... I think maybe there could be something like the transpose of group_by e.g. slice_by ???Sarena
Thank you all very much. I used a classic tidyverse approach of group_by, gather, spread, and summing up on the way (very similar to what was proposed below by "Lorenzo G" and "G. Grothendieck" in answer #1). I never worked with slice_by but I guess that would work nicely as well. I wanted to use a map approach to make the code even shorter and standardized, and the solution proposed by "akrun" perfectly fits that need. Thank you again!Leboff
R
31

Here is one option with purrr. We get the unique prefix of the names of the dataset ('nm1'), use map (from purrr) to loop through the unique names, select the column that matches the prefix value of 'nm1', add the rows using reduce and the bind the columns (bind_cols) with the original dataset

library(tidyverse)
nm1 <- names(df) %>% 
          substr(1, 1) %>%
          unique 
nm1 %>% 
     map(~ df %>% 
            select(matches(.x)) %>%
            reduce(`+`)) %>%
            set_names(paste0("sum_", nm1)) %>%
     bind_cols(df, .)
#    a1 b1 c1 a2 b2 c2 sum_a sum_b sum_c
#1  1  4 10  9  3 15    10     7    25
#2  2  5 11 10  4 16    12     9    27
#3  3  6 12 11  5 17    14    11    29
#4  4  7 13 12  6 18    16    13    31
#5  5  8 14 13  7 19    18    15    33
Rutilant answered 13/4, 2018 at 12:42 Comment(6)
That is the solution I was looking for, thank you! It accomplishes what I would have done using gather, spread, and summing up on the way, but in less lines of code. I would say this is a very good solution to automate what I am intending to do. I knew purrr was powerful in that sense. I definitely need to read up on using purrr in order to incorporate it in my daily workflow.Leboff
names(df) %>% sub("\\d+$", "", .) %>% for many columns as per @docendodiscimus solutionSarena
@StephenHenderson Yes, that is a good way. Here, i thought if the letters are only at the first position, we can use substrRutilant
Can we use this method to do subtraction? And if so, how do we specify what to subtract from what?Saracen
@Saracen try reduce(-) (with backquotes). it should be in the order we used in nm1. If you want a different order, then have to rearrange 'nm1'Rutilant
The substr function extracts just a, b, and c. If I want to subtract a2-a1 should I rearrange the dataframe to be in the order a2 b2 c2 a1 b1 c1?Saracen
I
9
df %>% 
  mutate(sum_a = pmap_dbl(select(., starts_with("a")), sum), 
         sum_b = pmap_dbl(select(., starts_with("b")), sum),
         sum_c = pmap_dbl(select(., starts_with("c")), sum))

  a1 b1 c1 a2 b2 c2 sum_a sum_b sum_c
1  1  4 10  9  3 15    10     7    25
2  2  5 11 10  4 16    12     9    27
3  3  6 12 11  5 17    14    11    29
4  4  7 13 12  6 18    16    13    31
5  5  8 14 13  7 19    18    15    33

EDIT:

In the case there are many columns, and you wish to apply it programmatically:

row_sums <- function(x) {
  transmute(df, !! paste0("sum_", quo_name(x)) := pmap_dbl(select(df, starts_with(x)), sum))
}

newdf <- map_dfc(letters[1:3], row_sums)
newdf

  sum_a sum_b sum_c
1    10     7    25
2    12     9    27
3    14    11    29
4    16    13    31
5    18    15    33

And if needed you can tack on the original variables with:

bind_cols(df, dfnew)

  a1 b1 c1 a2 b2 c2 sum_a sum_b sum_c
1  1  4 10  9  3 15    10     7    25
2  2  5 11 10  4 16    12     9    27
3  3  6 12 11  5 17    14    11    29
4  4  7 13 12  6 18    16    13    31
5  5  8 14 13  7 19    18    15    33
Ichthyosis answered 30/5, 2018 at 17:9 Comment(0)
G
5

In case you like to consider a base R approach, here's how you could do it:

cbind(df, lapply(split.default(df, substr(names(df), 0,1)), rowSums))
#  a1 b1 c1 a2 b2 c2  a  b  c
#1  1  4 10  9  3 15 10  7 25
#2  2  5 11 10  4 16 12  9 27
#3  3  6 12 11  5 17 14 11 29
#4  4  7 13 12  6 18 16 13 31
#5  5  8 14 13  7 19 18 15 33

It splits the data column-wise into a list, based on the first letter of each column name (either a, b, or c).

If you have a large number of columns and need to differentiate between all characters except the numbers at the end of each column name, you could modify the approach to:

cbind(df, lapply(split.default(df, sub("\\d+$", "", names(df))), rowSums))
Gershwin answered 13/4, 2018 at 12:15 Comment(0)
I
5

Here is another tidyverse approach that uses only the pipe and doesn't require to create new objects.

library(tidyverse)

df %>% 
  bind_cols(
    map_dfc(.x = list("a", "b", "c"), 
            .f = ~ .y %>% 
               rowwise() %>% 
               transmute(!!str_c("sum_", .x) := sum(c_across(starts_with(.x)))),
            .y = .)
  )
#>   a1 b1 c1 a2 b2 c2 sum_a sum_b sum_c
#> 1  1  4 10  9  3 15    10     7    25
#> 2  2  5 11 10  4 16    12     9    27
#> 3  3  6 12 11  5 17    14    11    29
#> 4  4  7 13 12  6 18    16    13    31
#> 5  5  8 14 13  7 19    18    15    33

Explanation

The dataframe is piped into bind_cols() which binds the original columns with the newly created columns. The new columns are created with purrr::map_dfc() which takes a list of variable prefixes (.x) and the transforming function (.f). Additionally, the piped data (.) is assigned as another argument (.y). Since rowwise operations are required, rowwise() and c_across() are used in each iteration over the prefixes. transmute is used so that the original variables are not duplicated. In order to dynamically create variable names, the bang-bang operator (!!) along with := are used inside transmute.

Note

It would be shorter to use rowSums() instead of rowwise() and c_across() but other functions can easier be implemented using this approach.

Illtreat answered 12/1, 2022 at 4:29 Comment(1)
Can we use this method to do subtraction? And if so, how do we specify what to subtract from what?Saracen
S
4

in base R, all vectorized:

nms <- names(df)
df[paste0("sum_",unique(gsub("[1-9]","",nms)))] <-
  df[endsWith(nms,"1")] + df[endsWith(nms,"2")]

#   a1 b1 c1 a2 b2 c2 sum_a sum_b sum_c
# 1  1  4 10  9  3 15    10     7    25
# 2  2  5 11 10  4 16    12     9    27
# 3  3  6 12 11  5 17    14    11    29
# 4  4  7 13 12  6 18    16    13    31
# 5  5  8 14 13  7 19    18    15    33
Snaggy answered 15/4, 2018 at 23:46 Comment(1)
If you want to make it extendable to an arbitrary function too - Map(`+`, df[endsWith(names(df),"1")], df[endsWith(names(df),"2")])Chindwin
L
2

1) dplyr/tidyr Convert to long form, summarize and convert back to wide form:

library(dplyr)
library(tidyr)

DF %>%
  mutate(Row = 1:n()) %>%
  gather(colname, value, -Row) %>%
  group_by(g = gsub("\\d", "", colname), Row) %>%
  summarize(sum = sum(value)) %>%
  ungroup %>%
  mutate(g = paste("sum", g, sep = "_")) %>%
  spread(g, sum) %>%
  arrange(Row) %>%
  cbind(DF, .) %>%
  select(-Row)

giving:

  a1 b1 c1 a2 b2 c2 sum_a sum_b sum_c
1  1  4 10  9  3 15    10     7    25
2  2  5 11 10  4 16    12     9    27
3  4  7 13 12  6 18    16    13    31
4  5  8 14 13  7 19    18    15    33

2) base using matrix multiplication

nms is a vector of column names without the digits and prefaced with sum_. u is a vector of the unique elements of it. Form a logical matrix using outer from that which when multiplied by DF gives the sums -- the logicals get converted to 0-1 when that is done. Finally bind it to the input.

nms <- gsub("(\\D+)\\d", "sum_\\1", names(DF))
u <- unique(nms)
sums <- as.matrix(DF) %*% outer(nms, setNames(u, u), "==")
cbind(DF, sums)

giving:

  a1 b1 c1 a2 b2 c2 sum_a sum_b sum_c
1  1  4 10  9  3 15    10     7    25
2  2  5 11 10  4 16    12     9    27
3  4  7 13 12  6 18    16    13    31
4  5  8 14 13  7 19    18    15    33

3) base with tapply

Using nms from (2) apply tapply to each row:

cbind(DF, t(apply(DF, 1, tapply, nms, sum)))

giving:

  a1 b1 c1 a2 b2 c2 sum_a sum_b sum_c
1  1  4 10  9  3 15    10     7    25
2  2  5 11 10  4 16    12     9    27
3  4  7 13 12  6 18    16    13    31
4  5  8 14 13  7 19    18    15    33

You may wish to replace nms with factor(nms, levels = unique(nms)) in the above expression if the names are not in ascending order.

Laryngotomy answered 13/4, 2018 at 12:36 Comment(0)
D
2

For a hackish tidy solution, check this out:

library(tidyr)
library(dplyr)

df %>% 
   rownames_to_column(var = 'row') %>% 
   gather(a1:c2, key = 'key', value = 'value') %>% 
   extract(key, into = c('col.base', 'col.index'), regex = '([a-zA-Z]+)([0-9]+)') %>% 
   group_by(row, col.base) %>% 
   summarize(.sum = sum(value)) %>%
   spread(col.base, .sum) %>% 
   bind_cols(df, .) %>% 
   select(-row)

Basically, I collect all pairs of columns with their values across all rows, separate the column name in two parts, calculate the row sums for columns with the same letter, and cast it back to the wide form.

Darlleen answered 13/4, 2018 at 12:45 Comment(1)
Similar to what I would also have done. A nice tidyverse approach. Thanks!Leboff
F
1

Another solution that splits df by the numbers than use Reduce to calculate the sum

library(tidyverse)

df %>% 
  split.default(., substr(names(.), 2, 3)) %>% 
  Reduce('+', .) %>% 
  set_names(paste0("sum_", substr(names(.), 1, 1))) %>% 
  cbind(df, .)

#>   a1 b1 c1 a2 b2 c2 sum_a sum_b sum_c
#> 1  1  4 10  9  3 15    10     7    25
#> 2  2  5 11 10  4 16    12     9    27
#> 3  3  6 12 11  5 17    14    11    29
#> 4  4  7 13 12  6 18    16    13    31
#> 5  5  8 14 13  7 19    18    15    33

Created on 2018-04-13 by the reprex package (v0.2.0).

Flotage answered 13/4, 2018 at 12:49 Comment(0)
A
1

A slightly different approach using base R:

cbind(df, lapply(unique(gsub("\\d+","", colnames(df))), function(li) {
   set_names(data.frame(V = apply(df[grep(li, colnames(df), val = T)], FUN = sum, MARGIN = 1)), paste0("sum_", li))
}))
#  a1 b1 c1 a2 b2 c2 sum_a sum_b sum_c
#1  1  4 10  9  3 15    10     7    25
#2  2  5 11 10  4 16    12     9    27
#3  3  6 12 11  5 17    14    11    29
#4  4  7 13 12  6 18    16    13    31
#5  5  8 14 13  7 19    18    15    33
Argue answered 29/12, 2018 at 5:53 Comment(0)
E
0

There are two approaches not covered yet in the existing answers:

  1. dplyr::across
  2. the 'dplyover' package using dplyover::across() or dplyover::over()

Disclaimer: I'm the maintainer of 'dplyover' and the package is not on CRAN.

Let's start with a very efficient 'dplyr' only solution using across(): Inside of across we can use cur_column() to get the name of the current column. Based on this we can use gsub() to change the column name from a1 to a2 for example and then use get() to get the values of this column.

We can use the same principle inside the .names argument. Here we access the name of the current column with {.x}. We can use other function expressions such as gsub() by wrapping them in curly braces.

library(dplyr) 

df %>% 
  mutate(across(ends_with("1"),
                ~ .x + get(gsub("(.*)1$",    # <- search for pattern ending with 1 
                                "\\12",      # <- replace with everything \\1 and add 2
                                cur_column() # <- apply this to current column name
                                )
                           ),
                .names = "sum_{gsub('$1', '', {.col} )}"
                )
         )
#>   a1 b1 c1 a2 b2 c2 sum_a1 sum_b1 sum_c1
#> 1  1  4 10  9  3 15     10      7     25
#> 2  2  5 11 10  4 16     12      9     27
#> 3  3  6 12 11  5 17     14     11     29
#> 4  4  7 13 12  6 18     16     13     31
#> 5  5  8 14 13  7 19     18     15     33

The above approach is computational efficient, especially with grouped data, but its quite verbose and hard to make sense of.

The 'dplyover' package come with two alternatives which are made for this job. The first is across2(). We can create new columns iterating over pairs of columns. Here ends_with("1") and ends_with("2"). Finally, across2() makes it easy to name the new columns with its .names argument. We can use {pre} to find the common prefix of the current column pairs.

library(dplyover)

# using across2()
df %>% 
  mutate(across2(ends_with("1"),
                 ends_with("2"),
                 ~ .x + .y,
                 .names = "sum_{pre}")
         )
#>   a1 b1 c1 a2 b2 c2 sum_a sum_b sum_c
#> 1  1  4 10  9  3 15    10     7    25
#> 2  2  5 11 10  4 16    12     9    27
#> 3  3  6 12 11  5 17    14    11    29
#> 4  4  7 13 12  6 18    16    13    31
#> 5  5  8 14 13  7 19    18    15    33

One draw back of across2() is, it assumes that the columns to add are in the correct order. If we have many column it might not be easy to check if this is the case. The make the approach more programmtically safe we can use over(). Here we iterate over a string of column name stems. We generate this string with cut_names() which cuts the column names of before a certain pattern, here a digit "\d" this yields a vector c("a", "b", "c"). We can the construct the variable names as strings wrapped inside .() where {.x} evaluates to the current string. Again .names makes it easy to name the output columns.

# using over()

df %>% 
  mutate(over(cut_names("\\d"), # <- this yields c("a", "b", "c")
              ~ .("{.x}1") + .("{.x}2"),
              .names = "sum_{x}")
         )
#>   a1 b1 c1 a2 b2 c2 sum_a sum_b sum_c
#> 1  1  4 10  9  3 15    10     7    25
#> 2  2  5 11 10  4 16    12     9    27
#> 3  3  6 12 11  5 17    14    11    29
#> 4  4  7 13 12  6 18    16    13    31
#> 5  5  8 14 13  7 19    18    15    33

Data from OP:

df <- data.frame(a1 = c(1:5), 
                 b1 = c(4:8), 
                 c1 = c(10:14), 
                 a2 = c(9:13), 
                 b2 = c(3:7), 
                 c2 = c(15:19))

Created on 2023-07-15 with reprex v2.0.2

Echols answered 15/7, 2023 at 10:42 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.