data.table row-wise sum, mean, min, max like dplyr?
Asked Answered
H

6

38

There are other posts about row-wise operators on datatable. They are either too simple or solves a specific scenario

My question here is more generic. There is a solution using dplyr. I have played around but failed to find a an equivalent solution using data.table syntax. Can you please suggest an elegant data.table solution that reproduce the same results than the dplyr version?

EDIT 1: Summary of benchmarks of the suggested solutions on real dataset (10MB, 73000 rows, stats made on 24 numeric columns). The benchmark results is subjective. However, the elapsed time is consistently reproducible.

| Solution By | Speed compared to dplyr     |
|-------------|-----------------------------|
| Metrics v1  |  4.3 times SLOWER (use .SD) |
| Metrics v2  |  5.6 times FASTER           |
| ExperimenteR| 15   times FASTER           |
| Arun v1     |  3   times FASTER (Map func)|
| Arun v2     |  3   times FASTER (foo func)|
| Ista        |  4.5 times FASTER           |

EDIT 2: I have added NACount column a day after. This is why this column is not found in the solutions suggested by various contributors.

Data Setup

library(data.table)
dt <- data.table(ProductName = c("Lettuce", "Beetroot", "Spinach", "Kale", "Carrot"),
    Country = c("CA", "FR", "FR", "CA", "CA"),
    Q1 = c(NA, 61, 40, 54, NA), Q2 = c(22,  8, NA,  5, NA),
    Q3 = c(51, NA, NA, 16, NA), Q4 = c(79, 10, 49, NA, NA))

#    ProductName Country Q1 Q2 Q3 Q4
# 1:     Lettuce      CA NA 22 51 79
# 2:    Beetroot      FR 61  8 NA 10
# 3:     Spinach      FR 40 NA NA 49
# 4:        Kale      CA 54  5 16 NA
# 5:      Carrot      CA NA NA NA NA

SOLUTION using dplyr + rowwise()

library(dplyr) ; library(magrittr)
dt %>% rowwise() %>% 
    transmute(ProductName, Country, Q1, Q2, Q3, Q4,
     AVG = mean(c(Q1, Q2, Q3, Q4), na.rm=TRUE),
     MIN = min (c(Q1, Q2, Q3, Q4), na.rm=TRUE),
     MAX = max (c(Q1, Q2, Q3, Q4), na.rm=TRUE),
     SUM = sum (c(Q1, Q2, Q3, Q4), na.rm=TRUE),
     NAcnt= sum(is.na(c(Q1, Q2, Q3, Q4))))

#   ProductName Country Q1 Q2 Q3 Q4      AVG MIN  MAX SUM NAcnt
# 1     Lettuce      CA NA 22 51 79 50.66667  22   79 152     1
# 2    Beetroot      FR 61  8 NA 10 26.33333   8   61  79     1
# 3     Spinach      FR 40 NA NA 49 44.50000  40   49  89     2
# 4        Kale      CA 54  5 16 NA 25.00000   5   54  75     1
# 5      Carrot      CA NA NA NA NA      NaN Inf -Inf   0     4

ERROR with data.table (compute entire column instead of per-row)

dt[, .(ProductName, Country, Q1, Q2, Q3, Q4,
    AVG = mean(c(Q1, Q2, Q3, Q4), na.rm=TRUE),
    MIN = min (c(Q1, Q2, Q3, Q4), na.rm=TRUE),
    MAX = max (c(Q1, Q2, Q3, Q4), na.rm=TRUE),
    SUM = sum (c(Q1, Q2, Q3, Q4), na.rm=TRUE),
    NAcnt= sum(is.na(c(Q1, Q2, Q3, Q4))))]

#    ProductName Country Q1 Q2 Q3 Q4      AVG MIN MAX SUM NAcnt
# 1:     Lettuce      CA NA 22 51 79 35.90909   5  79 395     9
# 2:    Beetroot      FR 61  8 NA 10 35.90909   5  79 395     9
# 3:     Spinach      FR 40 NA NA 49 35.90909   5  79 395     9
# 4:        Kale      CA 54  5 16 NA 35.90909   5  79 395     9
# 5:      Carrot      CA NA NA NA NA 35.90909   5  79 395     9

ALMOST solution but more complex and missing Q1,Q2,Q3,Q4 output columns

dtmelt <- reshape2::melt(dt, id=c("ProductName", "Country"),
            variable.name="Quarter", value.name="Qty")

dtmelt[, .(AVG = mean(Qty, na.rm=TRUE),
    MIN = min (Qty, na.rm=TRUE),
    MAX = max (Qty, na.rm=TRUE),
    SUM = sum (Qty, na.rm=TRUE),
    NAcnt= sum(is.na(Qty))), by = list(ProductName, Country)]

#    ProductName Country      AVG MIN  MAX SUM NAcnt
# 1:     Lettuce      CA 50.66667  22   79 152     1
# 2:    Beetroot      FR 26.33333   8   61  79     1
# 3:     Spinach      FR 44.50000  40   49  89     2
# 4:        Kale      CA 25.00000   5   54  75     1
# 5:      Carrot      CA      NaN Inf -Inf   0     4
Hyperaesthesia answered 7/7, 2015 at 1:57 Comment(5)
dt[, AVG := rowMeans(.SD, na.rm=T),.SDcols=c(Q1, Q2,Q3,Q4)]Sisyphean
@Sisyphean thanks (should SDcols be a char vector?) I tried this dt[, .(Q1, Q2, Q3, Q4, AVG = rowMeans(.SD, na.rm=T), MIN = pmin(Q1,Q2,Q3,Q4, na.rm=T), MAX = pmax(Q1,Q2,Q3,Q4, na.rm=T) ), .SDcols=c("Q1","Q2","Q3","Q4")] but still misses SUM and doesn't have ProductName, Country columnsHyperaesthesia
@Metrics there is no output b/c of evaluation error: dt[, `:=` (AVG = rowMeans(.SD, na.rm=TRUE), MIN = min(.SD, na.rm=TRUE), MAX = max(.SD, na.rm=TRUE), SUM = sum(.SD, na.rm=TRUE)), .SDcols = c("Q1","Q2","Q3","Q4"), by=1:nrow(dt)] Warning messages: 1: In min(c(NA_real_, NA_real_, NA_real_, NA_real_), na.rm = TRUE) : no non-missing arguments to min; returning Inf 2: In max(c(NA_real_, NA_real_, NA_real_, NA_real_), na.rm = TRUE) : no non-missing arguments to max; returning -InfHyperaesthesia
See my answer. I have updated the code and removed from comments. Dplyr and data.table both issue warnings for NaN and -Inf.Chirrup
data.table uses base R functions wherever possible so as to not impose a "walled garden" approach.. However base R doesn't have a nice function that does this operation :-(. So we'll have to implement colwise() and rowwise() functions as filed under #1063... I've marked it for next release.Billetdoux
S
48

You can use an efficient row-wise functions from matrixStats package.

library(matrixStats)
dt[, `:=`(MIN = rowMins(as.matrix(.SD), na.rm=T),
          MAX = rowMaxs(as.matrix(.SD), na.rm=T),
          AVG = rowMeans(.SD, na.rm=T),
          SUM = rowSums(.SD, na.rm=T)), .SDcols=c(Q1, Q2,Q3,Q4)]

dt
#    ProductName Country Q1 Q2 Q3 Q4 MIN  MAX      AVG SUM
# 1:     Lettuce      CA NA 22 51 79  22   79 50.66667 152
# 2:    Beetroot      FR 61  8 NA 10   8   61 26.33333  79
# 3:     Spinach      FR 40 NA 79 49  40   79 56.00000 168
# 4:        Kale      CA 54  5 16 NA   5   54 25.00000  75
# 5:      Carrot      CA NA NA NA NA Inf -Inf      NaN   0

For dataset with 500000 rows(using the data.table from CRAN)

dt <- rbindlist(lapply(1:100000, function(i)dt))
system.time(dt[, `:=`(MIN = rowMins(as.matrix(.SD), na.rm=T),
                      MAX = rowMaxs(as.matrix(.SD), na.rm=T),
                      AVG = rowMeans(.SD, na.rm=T),
                      SUM = rowSums(.SD, na.rm=T)), .SDcols=c("Q1", "Q2","Q3","Q4")])
#  user  system elapsed 
# 0.089   0.004   0.093

rowwise (or by=.I) is "euphemism" for for loop, as exemplified by

library(dplyr) ; library(magrittr)
system.time(dt %>% rowwise() %>% 
  transmute(ProductName, Country, Q1, Q2, Q3, Q4,
            MIN = min (c(Q1, Q2, Q3, Q4), na.rm=TRUE),
            MAX = max (c(Q1, Q2, Q3, Q4), na.rm=TRUE),
            AVG = mean(c(Q1, Q2, Q3, Q4), na.rm=TRUE),
            SUM = sum (c(Q1, Q2, Q3, Q4), na.rm=TRUE)))
#   user  system elapsed 
# 80.832   0.111  80.974 

system.time(dt[, `:=`(AVG= mean(as.numeric(.SD),na.rm=TRUE),MIN = min(.SD, na.rm=TRUE),MAX = max(.SD, na.rm=TRUE),SUM = sum(.SD, na.rm=TRUE)),.SDcols=c("Q1", "Q2","Q3","Q4"),by=.I] )
#    user  system elapsed 
# 141.492   0.196 141.757
Sisyphean answered 7/7, 2015 at 2:32 Comment(3)
your solution is the fastest! (see benchmarks in original question) Thanks for introducing the matrixStats package. I would like to know the impact on memory resources of your solution compared to that of Arun's and Metrics' 2nd solution.Hyperaesthesia
@Sisyphean how's this able to work? dt <- rbindlist(lapply(1:100000, function(i)dt)) . I tried decomposing it but returned error dt(list(1)) . Elegant soln thoughHinch
oh i see!!! you duplicated the original data.table multiple times and combined all their rowsHinch
C
19

With by=.I, performs the rowwise operation in data.table

 library(data.table)
dt[, `:=`(AVG= mean(as.numeric(.SD),na.rm=TRUE),MIN = min(.SD, na.rm=TRUE),MAX = max(.SD, na.rm=TRUE),SUM = sum(.SD, na.rm=TRUE)),.SDcols=c(Q1, Q2,Q3,Q4),by=.I] 
   ProductName Country Q1 Q2 Q3 Q4      AVG MIN  MAX SUM
1:     Lettuce      CA NA 22 51 79 50.66667  22   79 152
2:    Beetroot      FR 61  8 NA 10 26.33333   8   61  79
3:     Spinach      FR 40 NA 79 49 56.00000  40   79 168
4:        Kale      CA 54  5 16 NA 25.00000   5   54  75
5:      Carrot      CA NA NA NA NA      NaN Inf -Inf   0

Warning messages:
1: In min(c(NA_real_, NA_real_, NA_real_, NA_real_), na.rm = TRUE) :
  no non-missing arguments to min; returning Inf
2: In max(c(NA_real_, NA_real_, NA_real_, NA_real_), na.rm = TRUE) :
  no non-missing arguments to max; returning -Inf

You got warning messages, because in row 5, you are computing max, sum, min, and max of nothing. For example, see below:

min(c(NA,NA,NA,NA),na.rm=TRUE)
[1] Inf
Warning message:
In min(c(NA, NA, NA, NA), na.rm = TRUE) :
  no non-missing arguments to min; returning Inf
Chirrup answered 7/7, 2015 at 2:48 Comment(8)
Same error, could that be b/c I am using latest data.table 1.9.4 (R version 3.2.0 (2015-04-16))? In addition, I must put SDcols in quotes .SDcols=c("Q1","Q2","Q3","Q4") to avoid "object 'Q1' not found". Here is the error when I run your code: 1: In min(c(NA_real_, NA_real_, NA_real_, NA_real_), na.rm = TRUE) : no non-missing arguments to min; returning Inf 2: In max(c(NA_real_, NA_real_, NA_real_, NA_real_), na.rm = TRUE) : no non-missing arguments to max; returning -InfHyperaesthesia
Those are the warnings and not errors (I got it too). You got warnings because your output returns infinite values -Inf,Inf, and NaN (because you are taking the average, sum, min, and max of nothing). If you ran your own dplyr code, it also issues the same warnings. I am using development version 1.9.5+ (you can get it from github). I am not sure why you need to put quotes. It runs without quotes for me. See my updates in the answer.Chirrup
Oh that's true. I forgot to print(dt). Sorry! BTW, do you know why I got object 'Q1' not found if I don't put quotes around column names in .SDcols=c(Q1,Q2,Q3,Q4) (data.table 1.9.4, R v3.2.0)Hyperaesthesia
Just applied your solution on a 10MB dataset, 73000 rows. The dplyr version is 4 times faster than the implementation you suggested. Could that be the as.numeric(.SD) in the calculation of AVG?Hyperaesthesia
You can't benchmark on such small data set it is pretty meaningless.Encephalo
Yes @David. You are correct. It doesn't make sense. I have omitted it now.Chirrup
@Polymerase: I think it has to do with .SD.. Try this: where you have to enter all column names: dt[,:=(AVF = mean (c(Q1, Q2, Q3, Q4), na.rm=TRUE),MIN = min (c(Q1, Q2, Q3, Q4), na.rm=TRUE),MAX = max (c(Q1, Q2, Q3, Q4), na.rm=TRUE),AVG = mean(c(Q1, Q2, Q3, Q4), na.rm=TRUE),SUM = sum (c(Q1, Q2, Q3, Q4), na.rm=TRUE)),by=1:nrow(dt)]. This is faster than your dplyr for your small sample data.Chirrup
@Chirrup the 2nd version you suggested is very fast. Let me test all soltions here and I'll make a summary of all my tests.Hyperaesthesia
B
8

Just another way (not that efficient though, as na.omit() is called each time, and many memory allocations as well):

require(data.table)
new_cols = c("MIN", "MAX", "SUM", "AVG")
dt[, (new_cols) := Map(function(x, f) f(x), 
                       list(na.omit(c(Q1,Q2,Q3,Q4))), 
                       list(min, max, sum, mean)),
   by = .I]

#    ProductName Country Q1 Q2 Q3 Q4 MIN  MAX SUM      AVG
# 1:     Lettuce      CA NA 22 51 79  22   79 152 50.66667
# 2:    Beetroot      FR 61  8 NA 10   8   61  79 26.33333
# 3:     Spinach      FR 40 NA 79 49  40   79 168 56.00000
# 4:        Kale      CA 54  5 16 NA   5   54  75 25.00000
# 5:      Carrot      CA NA NA NA NA Inf -Inf   0      NaN

But as I mentioned, this'll get much simpler once colwise() and rowwise() are implemented. The syntax in this case could look something like:

dt[, rowwise(.SD, list(MIN=min, MAX=max, SUM=sum, AVG=mean), na.rm=TRUE), by = .I]
# `by = ` is really not necessary in this case.

or even more straightforward for this case:

rowwise(dt, list(...), na.rm=TRUE)

Edit:

Another variation:

myNACount <- function(x, ...) length(attributes(x)$na.action)
foo <- function(x, ...) {
    funs = c(min, max, mean, sum, myNACount)
    lapply(funs, function(f) f(x, ...))
}

dt[, (new_cols) := foo(na.omit(c(Q1, Q2, Q3, Q4)), na.rm=TRUE), by=.I]
#    ProductName Country Q1 Q2 Q3 Q4 MIN  MAX      SUM AVG NAs
# 1:     Lettuce      CA NA 22 51 79  22   79 50.66667 152   1
# 2:    Beetroot      FR 61  8 NA 10   8   61 26.33333  79   1
# 3:     Spinach      FR 40 NA NA 49  40   49 44.50000  89   2
# 4:        Kale      CA 54  5 16 NA   5   54 25.00000  75   1
# 5:      Carrot      CA NA NA NA NA Inf -Inf      NaN   0   4
Billetdoux answered 7/7, 2015 at 12:57 Comment(7)
Yes, why did you add the by in the rowwise potential solution?Encephalo
There might be complex scenarios like dt[, if (TRUE) do_bla else rowwise(...), by=some_cols] (like I said, in this case, it isn't necessary).Billetdoux
Hi Arun, this is wonderful, the solution you suggested is 4 times FASTER than the dplyr version (tested on my real 10MB dataset). BTW, I have edited the original question (added the NAcount calculation). I have modified your example by adding MyNACount function. But got NAcnt=0 b/c na.omit() had removed all NA. Can you please suggested a solution? MyNACount <- function(vectNum) { sum(is.na(vectNum)) } new_cols = c("AVG", "MIN", "MAX", "SUM", "NAcnt") dt[, (new_cols) := Map(function(x, f) f(x), list(na.omit(c(Q1,Q2,Q3,Q4))), list(mean, min, max, sum, MyNACount)), by = 1:nrow(dt)]Hyperaesthesia
@Polymerase, you can define myNACount as follows: myNACount <- function(x) length(attributes(x)$na.action).Billetdoux
@Billetdoux That myNACount <- function(x) length(attributes(x)$na.action) function is outstanding. Thanks. I wish I could understand the mechanism of the optimization. The 2nd variation you suggested is blazingly fast.Hyperaesthesia
@Billetdoux Ahem ... sorry I made a mistake in the benchmark measure. The 2nd variation you made is slightly faster than the 1st version. The fastest exec time is from ExperimenteR's solution.Hyperaesthesia
@Polymerase, no worries. I think we all learned quite a bit here :-). Great Q.Billetdoux
S
2

The apply function can be used to perform row-wise calculations. Defining the function separately keeps things cleaner:

dstats <- function(x){
    c(mean(x,na.rm=TRUE),
      min(x, na.rm=TRUE),
      max(x, na.rm=TRUE),
      sum(x, na.rm=TRUE))
}

The function can now be applied over the rows of the data.table.

(dt[,
   c("AVG", "MIN", "MAX", "SUM") := data.frame(t(apply(.SD, 1, dstats))),
   .SDcols=c("Q1", "Q2","Q3","Q4"),
])

Notice that the only advantage of doing this with [.data.table is that it allows the use of := for fast adding by reference.

This is slower but more flexible than the matrixStats solution, and faster than the dplyr solution by @ExperimenteR, clocking in at 36 seconds (my timings for the other methods were similar to those in @ExperimenteR's answer).

Subrogate answered 7/7, 2015 at 14:47 Comment(4)
1. apply() converts .SD to a matrix = mem alloc. 2. t() transposes result = another copy. 3. data.frame() = another memory alloc. Not sure of the need for with = FALSE here. We can certainly do better by avoiding all these copies.Billetdoux
@Billetdoux Perhaps, but it is fairly quick already, and we can use matrixStats if we need more speed. I have with = FALSE because help(":=") implies that this is needed when the RHS returns a list.Subrogate
Fairly quick isn't good enough, really, especially when it's trivial to be much more efficient. I've replied to your reply on github project page detailing the reasons. On with=FALSE, that's not what it means, but I understand the confusion. Will fix.Billetdoux
@Subrogate your solution is the 2nd fastest, see benchmark results in original question.Hyperaesthesia
M
0

I hope others when encountering the same problem, they might find helpful.

1st Approach: Combining base R

dt[,`:=`(MIN = apply(dt[, Q1:Q4], 1, FUN = min, na.rm=TRUE),
       MAX = apply(dt[, Q1:Q4], 1, FUN = max, na.rm = TRUE),
       AVG = rowMeans(dt[, Q1:Q4], na.rm = TRUE),
       SUM = rowSums(dt[, Q1:Q4], na.rm = TRUE))][]
# ProductName Country Q1 Q2 Q3 Q4 MIN  MAX      AVG SUM
# 1:     Lettuce      CA NA 22 51 79  22   79 50.66667 152
# 2:    Beetroot      FR 61  8 NA 10   8   61 26.33333  79
# 3:     Spinach      FR 40 NA NA 49  40   49 44.50000  89
# 4:        Kale      CA 54  5 16 NA   5   54 25.00000  75
# 5:      Carrot      CA NA NA NA NA Inf -Inf      NaN   0

2nd Approach: based on @ExperimenteR idea, using matrixStats package

dt1 <- dt[,`:=`(MIN = rowMins(as.matrix(dt[, Q1:Q4]), na.rm=TRUE),
                MAX = rowMaxs(as.matrix(dt[, Q1:Q4]), na.rm = TRUE),
                AVG = rowMeans(dt[, Q1:Q4], na.rm = TRUE),
                SUM = rowSums(dt[, Q1:Q4], na.rm = TRUE))][]
# ProductName Country Q1 Q2 Q3 Q4 MIN  MAX      AVG SUM
# 1:     Lettuce      CA NA 22 51 79  22   79 50.66667 152
# 2:    Beetroot      FR 61  8 NA 10   8   61 26.33333  79
# 3:     Spinach      FR 40 NA NA 49  40   49 44.50000  89
# 4:        Kale      CA 54  5 16 NA   5   54 25.00000  75
# 5:      Carrot      CA NA NA NA NA Inf -Inf      NaN   0
Mantle answered 26/4, 2020 at 8:27 Comment(0)
G
0

I don't know how efficient this is, but I managed to do it by grouping by a column with unique values (e.g. IDs), in this case by = Product_Name:

dt[, .(Country, Q1, Q2, Q3, Q4,
   AVG = mean(c(Q1, Q2, Q3, Q4), na.rm=TRUE),
   MIN = min (c(Q1, Q2, Q3, Q4), na.rm=TRUE),
   MAX = max (c(Q1, Q2, Q3, Q4), na.rm=TRUE),
   SUM = sum (c(Q1, Q2, Q3, Q4), na.rm=TRUE),
   NAcnt= sum(is.na(c(Q1, Q2, Q3, Q4)))), by = ProductName]

# ProductName Country Q1 Q2 Q3 Q4      AVG MIN  MAX SUM NAcnt
# 1:     Lettuce      CA NA 22 51 79 50.66667  22   79 152     1
# 2:    Beetroot      FR 61  8 NA 10 26.33333   8   61  79     1
# 3:     Spinach      FR 40 NA NA 49 44.50000  40   49  89     2
# 4:        Kale      CA 54  5 16 NA 25.00000   5   54  75     1
# 5:      Carrot      CA NA NA NA NA      NaN Inf -Inf   0     4
Gauleiter answered 3/4 at 8:11 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.