data.table and pmin with na.rm=TRUE argument
Asked Answered
G

1

10

I am trying to calculate the minimum across rows using the pmin function and data.table (similar to the post row-by-row operations and updates in data.table) but with a character list of columns using something like the with=FALSE syntax, and with the na.rm=TRUE argument.

DT <- data.table(x = c(1,1,2,3,4,1,9), 
                 y = c(2,4,1,2,5,6,6),
                 z = c(3,5,1,7,4,5,3),
                 a = c(1,3,NA,3,5,NA,2))

> DT
   x y z  a
1: 1 2 3  1
2: 1 4 5  3
3: 2 1 1 NA
4: 3 2 7  3
5: 4 5 4  5
6: 1 6 5 NA
7: 9 6 3  2

I can calculate the minimum across rows using columns directly:

DT[,min_val := pmin(x,y,z,a,na.rm=TRUE)]

giving

> DT
   x y z  a min_val
1: 1 2 3  1       1
2: 1 4 5  3       1
3: 2 1 1 NA       1
4: 3 2 7  3       2
5: 4 5 4  5       4
6: 1 6 5 NA       1
7: 9 6 3  2       2

However, I am trying to do this over an automatically generated large set of columns, and I want to be able to do this across this arbitrary list of columns, stored in a col_names variable, col_names <- c("a","y","z')

I can do this:

DT[, col_min := do.call(pmin,DT[,col_names,with=FALSE])]

But it gives me NA values. I can't figure out how to pass the na.rm=TRUE argument into the do.call. I've tried defining the function as

DT[, col_min := do.call(function(x) pmin(x,na.rm=TRUE),DT[,col_names,with=FALSE])]

but this gives me an error. I also tried passing in the argument as an additional element in a list, but I think pmin (or do.call) gets confused between the DT non-standard evaluation of column names and the argument.

Any ideas?

Guffey answered 3/3, 2016 at 17:28 Comment(0)
P
14

If we need to get the minimum value of each row of the whole dataset, use the pmin, on .SD concatenate the na.rm=TRUE as a list with .SD for the do.call(pmin.

DT[, col_min:= do.call(pmin, c(.SD, list(na.rm=TRUE)))]
DT
#   x y z  a col_min
#1: 1 2 3  1       1
#2: 1 4 5  3       1
#3: 2 1 1 NA       1
#4: 3 2 7  3       2
#5: 4 5 4  5       4
#6: 1 6 5 NA       1
#7: 9 6 3  2       2

If we want only to do this only for a subset of column names stored in 'col_names', use the .SDcols.

DT[, col_min:= do.call(pmin, c(.SD, list(na.rm=TRUE))), 
                .SDcols= col_names]
Parallelism answered 3/3, 2016 at 17:31 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.