Efficient row-wise operations on a data.table
Asked Answered
H

3

44

I need to find the row-wise minimum of many (+60) relatively large data.frame (~ 250,000 x 3) (or I can equivalently work on an xts).

set.seed(1000)
my.df <- sample(1:5, 250000*3, replace=TRUE)
dim(my.df) <- c(250000,3)
my.df <- as.data.frame(my.df)
names(my.df) <- c("A", "B", "C")

The data frame my.df looks like this

> head(my.df)

  A B C
1 2 5 2
2 4 5 5
3 1 5 3
4 4 4 3
5 3 5 5
6 1 5 3

I tried

require(data.table)
my.dt <- as.data.table(my.df)

my.dt[, row.min:=0]  # without this: "Attempt to add new column(s) and set subset of rows at the same time"
system.time(
  for (i in 1:dim(my.dt)[1]) my.dt[i, row.min:= min(A, B, C)]
)

On my system this takes ~400 seconds. It works, but I am not confident it is the best way to use data.table. Am I using data.table correctly? Is there a more efficient way to do simple row-wise opertations?

Hilton answered 25/10, 2011 at 5:44 Comment(0)
T
56

Or, just pmin.

my.dt <- as.data.table(my.df)
system.time(my.dt[,row.min:=pmin(A,B,C)])
# user  system elapsed 
# 0.02    0.00    0.01 
head(my.dt)
#      A B C row.min
# [1,] 2 5 2       2
# [2,] 4 5 5       4
# [3,] 1 5 3       1
# [4,] 4 4 3       3
# [5,] 3 5 5       3
# [6,] 1 5 3       1
Toitoiboid answered 25/10, 2011 at 9:7 Comment(7)
Sorry if this is hijacking, but if I am dynamically creating the list of columns, e.g. any column that is a POSIXct type, or whose name matches an expression, giving me a variable such as: x=c("A","B","C") ... how do I use "x" in the pmin expression? It is using the literal values of x rather than scoping them out to the values in the columns.Glorious
@Glorious That would make a good question. Wild untested stab: DT[,do.call(pmin,.SD),.SDcols=x] or perhaps do.call(pmin,lapply(x,get)). The first way using .SDcols should be more efficient. You don't have to group to use .SD. You can add a by clause as well if you need to, without changing j.Toitoiboid
Thank you Matthew. The first worked, and I needed to add na.rm=TRUE to it, which took a little trial-and-error, but I ended up with DT <- DT[,NEWCOLNAME:=do.call(pmin, c(.SD, na.rm=TRUE)), .SDcols=x]. I was wondering where .SDcols was documented and found it in the NEWS for release 1.6.3, but not in any of the help() topics.Glorious
@Glorious Great. I just checked ?data.table and .SDcols is documented there (search the help page for "SDcols").Toitoiboid
Apologies, of all the obvious places to look, that was the one I didn't check. I was trying things like ??.SDcols and even just ??SD and not getting any relevant hits.Glorious
@Glorious No problem at all. I'm not sure what ?? does, but if there is a way to search all a package's documentation (all help files & vignettes) for a particular string, then I'd like to know, too!Toitoiboid
help("??") directs to help.search() which will search documentation "matching a given character string in the (file) name, alias, title, concept or keyword entries"... i.e. not the entire help text. Presumably .SDcols isn't in the concept or keywords, so it doesn't show up. The rules for searching a vignette are: "The 'name' and 'alias' are both the base of the vignette filename, and the 'concept' entries are taken from the \VignetteKeywords entries. Vignettes are not classified using the help system "keyword" classifications."Glorious
S
27

After some discussion around row-wise first/last occurrences from column series in data.table, which suggested that melting first would be faster than a row-wise calculation, I decided to benchmark:

  • pmin (Matt Dowle's answer above), below as tm1
  • apply (Andrie's answer above), below as tm2
  • melting first, then min by group, below as tm3

so:

library(microbenchmark); library(data.table)
set.seed(1000)
b <- data.table(m=integer(), n=integer(), tm1 = numeric(), tm2 = numeric(), tm3 = numeric())

for (m in c(2.5,100)*1e5){

  for (n in c(3,50)){
    my.df <- sample(1:5, m*n, replace=TRUE)
    dim(my.df) <- c(m,n)    
    my.df <- as.data.frame(my.df)
    names(my.df) <- c(LETTERS,letters)[1:n]   
    my.dt <- as.data.table(my.df)

    tm1 <- mean(microbenchmark(my.dt[, foo := do.call(pmin, .SD)], times=30L)$time)/1e6
    my.dt <- as.data.table(my.df)
    tm2 <- mean(microbenchmark(apply(my.dt, 1, min), times=30L)$time)/1e6
    my.dt <- as.data.table(my.df)sv
    tm3 <- mean(microbenchmark(
                melt(my.dt[, id:=1:nrow(my.dt)], id.vars='id')[, min(value), by=id], 
                times=30L
               )$time)/1e6
    b <- rbind(b, data.table(m, n, tm1, tm2, tm3) ) 
  }
}

(I ran out of time to try more combinations) gives us:

b
#          m  n        tm1       tm2         tm3
# 1: 2.5e+05  3   16.20598  1000.345    39.36171
# 2: 2.5e+05 50  166.60470  1452.239   588.49519
# 3: 1.0e+07  3  662.60692 31122.386  1668.83134
# 4: 1.0e+07 50 6594.63368 50915.079 17098.96169
c <- melt(b, id.vars=c('m','n'))

library(ggplot2)
ggplot(c, aes(x=m, linetype=as.factor(n), col=variable, y=value)) + geom_line() +
  ylab('Runtime (millisec)') + xlab('# of rows') +  
  guides(linetype=guide_legend(title='Number of columns'))

enter image description here

Although I knew apply (tm2) would scale poorly, I am surprised that pmin (tm1) scales so well if R is not really designed for row-wise operations. I couldn't identify a case where pmin shouldn't be used over melt-min-by-group (tm3).

Summation answered 1/12, 2015 at 11:42 Comment(3)
You don't need to create id here, you could just do melt(my.dt, measure.vars = names(my.dt))Neopythagoreanism
@DavidArenburg melt(my.dt, measure.vars = names(my.dt)) returns a data.table with two columns, variable (A,B,C) and value. Without id I don't know how to group by the original row number (of the unmelted data.table).Summation
3 years later, this is still the best answer to this question in general. I wish I could upvote multiple times.Disembroil
H
24

The classical way of doing row-wise operations in R is to use apply:

apply(my.df, 1, min)
> head(my.df)
  A B C min
1 2 5 4   2
2 4 3 1   1
3 1 1 5   1
4 4 1 5   1
5 3 3 4   3
6 1 1 1   1

On my machine, this operation takes about 0.25 of a second.

Hagbut answered 25/10, 2011 at 5:58 Comment(2)
Filed under: "try the easy stuff first".Hilton
This converts my.df implicitly to a matrix... generally not a good idea.Tasman

© 2022 - 2024 — McMap. All rights reserved.