Efficiently sum across multiple columns in R

Asked 12/3, 2015 at 9:32 Answered 28/7, 2023 at 19:55

I have the following condensed data set:

a<-as.data.frame(c(2000:2005))
a$Col1<-c(1:6)
a$Col2<-seq(2,12,2)

colnames(a)<-c("year","Col1","Col2")

for (i in 1:2){
  a[[paste("Var_", i, sep="")]]<-i*a[[paste("Col", i, sep="")]]
}

I would like to sum the columns Var1 and Var2, which I use:

a$sum<-a$Var_1 + a$Var_2

In reality my data set is much larger - I would like to sum from Var_1 to Var_n (n can be upto 20). There must be a more efficient way to do this than:

 a$sum<-a$Var_1 + ... + a$Var_n

Sheilasheilah answered 12/3, 2015 at 9:32 Comment(4)

Try with apply: a$sum <- apply(a[,-1], 1, sum) – Inflame 12/3, 2015 at 9:35

Thanks, works well in the following where i is the column index of Var_1 and j is the column index of Var_n a$sum <- apply(a[,c(i:j)], 1, sum) – Sheilasheilah 12/3, 2015 at 9:40

And automating the process even further (using #9277863) : a$sum <- apply(a[,c(match("Var_1",names(a)):match("Var_n",names(a)))], 1, sum) – Sheilasheilah 12/3, 2015 at 9:44

a$Col3 <- rowSums(a[,2:3]) – Wilbourn 12/3, 2015 at 9:48

Here's a solution using the tidyverse. You can extend it to as many columns as you like using the select() function to select the appropriate columns within a mutate().

library(tidyverse)

a<-as.data.frame(c(2000:2005))
a$Col1<-c(1:6)
a$Col2<-seq(2,12,2)

colnames(a)<-c("year","Col1","Col2")

for (i in 1:2){
    a[[paste("Var_", i, sep="")]]<-i*a[[paste("Col", i, sep="")]]
}
a
#>   year Col1 Col2 Var_1 Var_2
#> 1 2000    1    2     1     4
#> 2 2001    2    4     2     8
#> 3 2002    3    6     3    12
#> 4 2003    4    8     4    16
#> 5 2004    5   10     5    20
#> 6 2005    6   12     6    24

# Tidyverse solution
a %>%
    mutate(Total = select(., Var_1:Var_2) %>% rowSums(na.rm = TRUE))
#>   year Col1 Col2 Var_1 Var_2 Total
#> 1 2000    1    2     1     4     5
#> 2 2001    2    4     2     8    10
#> 3 2002    3    6     3    12    15
#> 4 2003    4    8     4    16    20
#> 5 2004    5   10     5    20    25
#> 6 2005    6   12     6    24    30

^{Created on 2019-01-01 by the reprex package (v0.2.1)}

Revolt answered 2/1, 2019 at 2:27 Comment(4)

How to calculate percent of each columns (Col1/total, caol2/total)? – Silvers 18/9, 2021 at 13:59

@Silvers - I'd add a mutate_at(), which mutates Col1:Col2, with a function that divides by the total column. – Revolt 22/9, 2021 at 17:27

@Matt Dancho, how would you modify it if they columns you wanted to sum across were not adjacent to each other? – Balikpapan 6/12, 2022 at 18:9

I just tried this on my data, and mutate(Total= apply(a %>% select(starts_with("Col")), 1, sum)) is orders of magnitude slower than simply Col1 + Col2 + ... + Coln, but your approach mutate(Total= select(., starts_with("Col")) %>% rowSums(na.rm = TRUE)) is on par (i.e. fast). I thought these would be nearly identical. Very surprised! – Myke 10/4, 2024 at 1:11

You can use colSums(a[,c("Var1", "Var2")]) or rowSums(a[,c("Var_1", "Var_2")]). In your case you want the latter.

Unbalanced answered 12/3, 2015 at 9:37 Comment(0)

with dplyr you can use

a %>%
rowwise() %>%
mutate(sum = sum(Col1,Col1, na.rm = T))

or more efficiently

a %>%
rowwise() %>%
mutate(sum = sum(across(starts_with("Col")), na.rm = T))

Contessacontest answered 18/5, 2021 at 15:57 Comment(1)

For large datasets, especially as number of groups approaches number of rows, rowwise is very slow. Also see my comment to Matt Dancho's answer, which compares sum to rowSums, the latter being much faster. – Myke 10/4, 2024 at 1:17

Benchmarking seems to show that plain Reduce('+', ...) is the fastest. Libraries just make it (at least slightly) slower, at least for mtcars, even if I expand it to be huge.

Unit: milliseconds
         expr        min         lq       mean     median         uq        max
      rowSums   8.672061   9.014344  13.708022   9.602312  10.672726  148.47183
       Reduce   2.994240   3.157500   6.331503   3.223612   3.616555   99.49181
        apply 524.488376 651.549401 771.095002 743.286441 857.993418 1235.53153
        Rfast   5.649006   5.901787  11.110896   6.387990   9.727408   66.03151
   DT_rowSums   9.209539   9.566574  20.955033  10.131163  12.967030  294.32911
    DT_Reduce   3.590719   3.774761  10.595256   3.924592   4.259343  340.52855
 tidy_rowSums  15.532917  15.997649  33.736883  17.316108  27.072343  343.21254
  tidy_Reduce   8.627810   8.960008  12.271105   9.603124  11.089334   79.98853

Code:

library('data.table')
library('tidyverse')
library('Rfast')
DFcars = data.table::copy(mtcars)
DFcars = do.call("rbind", replicate(10000, DFcars, simplify = FALSE))
DT_cars = data.table::copy(DFcars)
DFcars2 = data.table::copy(DFcars)
setDT(DT_cars)
colnms = c("mpg", "cyl", "disp", "hp", "drat")

microbenchmark::microbenchmark(
    rowSums =
        {
            DFcars$new_col = rowSums(DFcars[, colnms])
            (as.numeric(DFcars$new_col))
        },
    Reduce =
        {
            DFcars$new_col = Reduce('+', DFcars[, colnms])
            (as.numeric(DFcars$new_col))
        },
    apply =
        {
            DFcars$new_col = apply(DFcars[, 1:5], 1, sum)
            (as.numeric(DFcars$new_col))
        },
    Rfast =
        {
            DFcars$new_col = rowsums(as.matrix(DFcars[, colnms]))
            (as.numeric(DFcars$new_col))
        },
    DT_rowSums =
        {
            DT_cars[, new_col := rowSums(.SD), .SDcols = colnms]
            (as.numeric(DT_cars$new_col))
        },
    DT_Reduce =
        {
            DT_cars[, new_col := Reduce('+', .SD), .SDcols = colnms]
            (as.numeric(DT_cars$new_col))
        },
    tidy_rowSums =
        {
            DFcars2 = DFcars2 %>% mutate(new_col = select(., colnms) %>% rowSums())
            (as.numeric(DFcars2$new_col))
        },
    tidy_Reduce =
        {
            DFcars2 = DFcars2 %>% mutate(new_col = select(., colnms) %>% Reduce('+', .))
            (as.numeric(DFcars2$new_col))
        },
    check = 'equivalent'
)

Glasswort answered 3/3, 2022 at 5:27 Comment(1)

Thanks for the much needed benchmarking. Interesting how much slower the tidy versions are. – Myke 10/4, 2024 at 1:40

If you're working with a very large dataset, rowSums can be slow.

An alternative is the rowsums function from the Rfast package. This requires you to convert your data to a matrix in the process and use column indices rather than names. Here's an example based on your code:

## load Rfast
library(Rfast)

## create dataset
a <- as.data.frame(c(2000:2005))
a$Col1 <- c(1:6)
a$Col2 <- seq(2,12,2)

colnames(a) <- c("year","Col1","Col2")

for (i in 1:2){
  a[[paste("Var_", i, sep="")]] <- i*a[[paste("Col", i, sep="")]]
}

## get column indices based on names
col_st <- grep("Var_1", colnames(a))  # index of "Var_1" col
col_en <- grep("Var_2", colnames(a))  # index of "Var_2" col
cols   <- c(col_st:col_en)  # indices of all cols from "Var_1" to "Var_2"

## sum rows 4 to 5
a$Total <- rowsums(as.matrix(a[,cols]))

Chane answered 2/7, 2020 at 14:27 Comment(0)

Still another solution , with the janitor package:

janitor::adorn_totals(a, where = "col", ... = Var_1:Var_2)

or equivalently, using the compact tidyselect syntax,

janitor::adorn_totals(a, where = "col", ... = starts_with("Var"))

Remark that just janitor::adorn_totals(a, where = "col") would add up all numeric columns, except for the first one.

Sanborne answered 23/3, 2023 at 7:56 Comment(0)

You can use this:

library(dplyr)
a$Sum <- apply(a[,select(a, starts_with("Var_"))], 1, sum)

Hedonics answered 29/8, 2021 at 17:36 Comment(0)

In Base R:

You could simply just use sapply:

sapply(unique(sub(".$", "", colnames(a))), function(x) rowSums(a[startsWith(colnames(a), x)]))

This is very reliable, it works for anything.

Insouciant answered 13/10, 2021 at 9:32 Comment(0)

pick() is introduced in dplyr v1.1.0 to select the columns in mutate() and summarise():

library(dplyr)
a %>%
  mutate(Total = rowSums(pick(Var_1:Var_2), na.rm = TRUE))

Here are more applications of pick()

Lauretta answered 28/7, 2023 at 19:55 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags