Efficient row-wise operations on a data.table

Asked 25/10, 2011 at 5:44 Answered 1/12, 2015 at 11:42

I need to find the row-wise minimum of many (+60) relatively large data.frame (~ 250,000 x 3) (or I can equivalently work on an xts).

set.seed(1000)
my.df <- sample(1:5, 250000*3, replace=TRUE)
dim(my.df) <- c(250000,3)
my.df <- as.data.frame(my.df)
names(my.df) <- c("A", "B", "C")

The data frame my.df looks like this

> head(my.df)

  A B C
1 2 5 2
2 4 5 5
3 1 5 3
4 4 4 3
5 3 5 5
6 1 5 3

I tried

require(data.table)
my.dt <- as.data.table(my.df)

my.dt[, row.min:=0]  # without this: "Attempt to add new column(s) and set subset of rows at the same time"
system.time(
  for (i in 1:dim(my.dt)[1]) my.dt[i, row.min:= min(A, B, C)]
)

On my system this takes ~400 seconds. It works, but I am not confident it is the best way to use data.table. Am I using data.table correctly? Is there a more efficient way to do simple row-wise opertations?

Hilton answered 25/10, 2011 at 5:44 Comment(0)

Or, just pmin.

my.dt <- as.data.table(my.df)
system.time(my.dt[,row.min:=pmin(A,B,C)])
# user  system elapsed 
# 0.02    0.00    0.01 
head(my.dt)
#      A B C row.min
# [1,] 2 5 2       2
# [2,] 4 5 5       4
# [3,] 1 5 3       1
# [4,] 4 4 3       3
# [5,] 3 5 5       3
# [6,] 1 5 3       1

Toitoiboid answered 25/10, 2011 at 9:7 Comment(7)

Sorry if this is hijacking, but if I am dynamically creating the list of columns, e.g. any column that is a POSIXct type, or whose name matches an expression, giving me a variable such as: x=c("A","B","C") ... how do I use "x" in the pmin expression? It is using the literal values of x rather than scoping them out to the values in the columns. – Glorious 6/11, 2012 at 22:54

@Glorious That would make a good question. Wild untested stab: DT[,do.call(pmin,.SD),.SDcols=x] or perhaps do.call(pmin,lapply(x,get)). The first way using .SDcols should be more efficient. You don't have to group to use .SD. You can add a by clause as well if you need to, without changing j. – Toitoiboid 7/11, 2012 at 0:8

Thank you Matthew. The first worked, and I needed to add na.rm=TRUE to it, which took a little trial-and-error, but I ended up with DT <- DT[,NEWCOLNAME:=do.call(pmin, c(.SD, na.rm=TRUE)), .SDcols=x]. I was wondering where .SDcols was documented and found it in the NEWS for release 1.6.3, but not in any of the help() topics. – Glorious 7/11, 2012 at 16:55

@Glorious Great. I just checked ?data.table and .SDcols is documented there (search the help page for "SDcols"). – Toitoiboid 7/11, 2012 at 16:59

Apologies, of all the obvious places to look, that was the one I didn't check. I was trying things like ??.SDcols and even just ??SD and not getting any relevant hits. – Glorious 7/11, 2012 at 17:40

@Glorious No problem at all. I'm not sure what ?? does, but if there is a way to search all a package's documentation (all help files & vignettes) for a particular string, then I'd like to know, too! – Toitoiboid 7/11, 2012 at 20:26

help("??") directs to help.search() which will search documentation "matching a given character string in the (file) name, alias, title, concept or keyword entries"... i.e. not the entire help text. Presumably .SDcols isn't in the concept or keywords, so it doesn't show up. The rules for searching a vignette are: "The 'name' and 'alias' are both the base of the vignette filename, and the 'concept' entries are taken from the \VignetteKeywords entries. Vignettes are not classified using the help system "keyword" classifications." – Glorious 13/11, 2012 at 0:50

After some discussion around row-wise first/last occurrences from column series in data.table, which suggested that melting first would be faster than a row-wise calculation, I decided to benchmark:

pmin (Matt Dowle's answer above), below as tm1
apply (Andrie's answer above), below as tm2
melting first, then min by group, below as tm3

so:

library(microbenchmark); library(data.table)
set.seed(1000)
b <- data.table(m=integer(), n=integer(), tm1 = numeric(), tm2 = numeric(), tm3 = numeric())

for (m in c(2.5,100)*1e5){

  for (n in c(3,50)){
    my.df <- sample(1:5, m*n, replace=TRUE)
    dim(my.df) <- c(m,n)    
    my.df <- as.data.frame(my.df)
    names(my.df) <- c(LETTERS,letters)[1:n]   
    my.dt <- as.data.table(my.df)

    tm1 <- mean(microbenchmark(my.dt[, foo := do.call(pmin, .SD)], times=30L)$time)/1e6
    my.dt <- as.data.table(my.df)
    tm2 <- mean(microbenchmark(apply(my.dt, 1, min), times=30L)$time)/1e6
    my.dt <- as.data.table(my.df)sv
    tm3 <- mean(microbenchmark(
                melt(my.dt[, id:=1:nrow(my.dt)], id.vars='id')[, min(value), by=id], 
                times=30L
               )$time)/1e6
    b <- rbind(b, data.table(m, n, tm1, tm2, tm3) ) 
  }
}

(I ran out of time to try more combinations) gives us:

b
#          m  n        tm1       tm2         tm3
# 1: 2.5e+05  3   16.20598  1000.345    39.36171
# 2: 2.5e+05 50  166.60470  1452.239   588.49519
# 3: 1.0e+07  3  662.60692 31122.386  1668.83134
# 4: 1.0e+07 50 6594.63368 50915.079 17098.96169
c <- melt(b, id.vars=c('m','n'))

library(ggplot2)
ggplot(c, aes(x=m, linetype=as.factor(n), col=variable, y=value)) + geom_line() +
  ylab('Runtime (millisec)') + xlab('# of rows') +  
  guides(linetype=guide_legend(title='Number of columns'))

Although I knew apply (tm2) would scale poorly, I am surprised that pmin (tm1) scales so well if R is not really designed for row-wise operations. I couldn't identify a case where pmin shouldn't be used over melt-min-by-group (tm3).

Summation answered 1/12, 2015 at 11:42 Comment(3)

You don't need to create id here, you could just do melt(my.dt, measure.vars = names(my.dt)) – Neopythagoreanism 1/12, 2015 at 12:12

@DavidArenburg melt(my.dt, measure.vars = names(my.dt)) returns a data.table with two columns, variable (A,B,C) and value. Without id I don't know how to group by the original row number (of the unmelted data.table). – Summation 1/12, 2015 at 13:20

3 years later, this is still the best answer to this question in general. I wish I could upvote multiple times. – Disembroil 30/4, 2018 at 20:47

The classical way of doing row-wise operations in R is to use apply:

apply(my.df, 1, min)
> head(my.df)
  A B C min
1 2 5 4   2
2 4 3 1   1
3 1 1 5   1
4 4 1 5   1
5 3 3 4   3
6 1 1 1   1

On my machine, this operation takes about 0.25 of a second.

Hagbut answered 25/10, 2011 at 5:58 Comment(2)

Filed under: "try the easy stuff first". – Hilton 25/10, 2011 at 6:20

This converts my.df implicitly to a matrix... generally not a good idea. – Tasman 30/8, 2018 at 9:10

Recommended topics

Hot tags