Replace NA with 0, only in numeric columns in data.table
Asked Answered
D

3

13

I have a data.table with columns of different data types. My goal is to select only numeric columns and replace NA values within these columns by 0. I am aware that replacing na-values with zero goes like this:

DT[is.na(DT)] <- 0

To select only numeric columns, I found this solution, which works fine:

DT[, as.numeric(which(sapply(DT,is.numeric))), with = FALSE]

I can achieve what I want by assigning

DT2 <- DT[, as.numeric(which(sapply(DT,is.numeric))), with = FALSE]

and then do:

DT2[is.na(DT2)] <- 0

But of course I would like to have my original DT modified by reference. With the following, however:

DT[, as.numeric(which(sapply(DT,is.numeric))), with = FALSE]
                 [is.na(DT[, as.numeric(which(sapply(DT,is.numeric))), with = FALSE])]<- 0

I get

"Error in [.data.table([...] i is invalid type (matrix)"

What am I missing? Any help is much appreciated!!

Dempsey answered 23/5, 2016 at 12:53 Comment(4)
You are missing the basic syntax of data.tables, which don't do DT[...] <- y. Try reading the vignettes github.com/Rdatatable/data.table/wiki/Getting-started It's a more efficient way to learn than "finding solutions" for each step you think you need to take. The answer below doesn't even require the with=FALSE trick you found.Polycarp
Thanks for the advice. Could you please eloborate on the basic syntax error "...which don't do DT[...] <- y". What does that mean? Why does the assignment work in one case and not in the other case? I could not find anything in the vignettes, would still help me alot to understand..Dempsey
Data tables shouldn't be used like DT[...] <- y where ... is whatever you have in mind. Assignment is done with := or set not with a <-. The arrow way actually does work in special cases, in the sense that the table is modified, but it does not work by reference (last I checked) and so is not idiomatic. To work with data.tables, you'll have to learn some of their idioms. If you don't already know what I mean by :=, that's a good reason to check out the vignettes.Polycarp
a) It will be much more efficient to compute the column-list numeric_cols <- which(sapply(DT,is.numeric)) once-off at the top, instead of inside each j-expression, for each group. b) Then just reference DT[, numeric_cols] c) Yes, putting a function-call inside the j-expression is tricky and often tickles syntax error.Pointer
C
12

We can use set

for(j in seq_along(DT)){
    set(DT, i = which(is.na(DT[[j]]) & is.numeric(DT[[j]])), j = j, value = 0)
 }

Or create a index for numeric columns, loop through it and set the NA values to 0

ind <-   which(sapply(DT, is.numeric))
for(j in ind){
    set(DT, i = which(is.na(DT[[j]])), j = j, value = 0)
}

data

set.seed(24)
DT <- data.table(v1= c(NA, 1:4), v2 = c(NA, LETTERS[1:4]), v3=c(rnorm(4), NA))
Curbing answered 23/5, 2016 at 13:2 Comment(5)
What does set( ..., j = j, ...) mean? All columns? Surely we only need to do set() on the subset of columns that are numeric, as OP asked?Pointer
@Pointer Not all columns. In the code I got the ind which gets the column index of numeric columns, so, it is only looping through those columnsCurbing
Ok. Why can't you avoid looping, by using ind to index into names(DT) to get a list of column-names and pass that as the j-argument of set()? I guess the expression to find NAs would then need to be 2D. Well I guess set() is already fairly fast.Pointer
@Pointer Not sure I don't understand your question. The j can take either columnames or the column index. Here, 'ind' is the index.Curbing
Why can't you avoid the loop for(j in ind) { ... set(..., j=j, ...) } ? Can't you directly do set(DT, j=ind) in general? I think you could, but the only reason for the j-loop is that the i-expression to find NA rows for that specific j changes.Pointer
W
4

I wanted to explore and possibly improve on the excellent answer given above by @akrun. Here's the data he used in his example:

library(data.table)

set.seed(24)
DT <- data.table(v1= c(NA, 1:4), v2 = c(NA, LETTERS[1:4]), v3=c(rnorm(4), NA))
DT

#>    v1   v2         v3
#> 1: NA <NA> -0.5458808
#> 2:  1    A  0.5365853
#> 3:  2    B  0.4196231
#> 4:  3    C -0.5836272
#> 5:  4    D         NA

And the two methods he suggested to use:

fun1 <- function(x){
  for(j in seq_along(x)){
  set(x, i = which(is.na(x[[j]]) & is.numeric(x[[j]])), j = j, value = 0)
  }
}

fun2 <- function(x){
  ind <-   which(sapply(x, is.numeric))
  for(j in ind){
    set(x, i = which(is.na(x[[j]])), j = j, value = 0)
  }
}

I think the first method above is really genius as it exploits the fact that NAs are typed.

First of all, even though .SD is not available in i argument, it is possible to pull the column name with get(), so I thought I could sub-assign data.table this way:

fun3 <- function(x){
  nms <- names(x)[sapply(x, is.numeric)]
  for(j in nms){
    x[is.na(get(j)), (j):=0]
  }
}

Generic case, of course would be to rely on .SD and .SDcols to work only on numeric columns

fun4 <- function(x){
  nms <- names(x)[sapply(x, is.numeric)]
  x[, (nms):=lapply(.SD, function(i) replace(i, is.na(i), 0)), .SDcols=nms]  
}

But then I thought to myself "Hey, who says we can't go all the way to base R for this sort of operation. Here's simple lapply() with conditional statement, wrapped into setDT()

fun5 <- function(x){
setDT(
  lapply(x, function(i){
    if(is.numeric(i))
         i[is.na(i)]<-0
    i
  })
)
}

Finally,we could use the same idea of conditional to limit the columns on which we apply the set()

fun6 <- function(x){
  for(j in seq_along(x)){
    if (is.numeric(x[[j]]) )
      set(x, i = which(is.na(x[[j]])), j = j, value = 0)
  }
}

Here are the benchmarks:

microbenchmark::microbenchmark(
  for.set.2cond = fun1(copy(DT)),
  for.set.ind = fun2(copy(DT)),
  for.get = fun3(copy(DT)),
  for.SDcol = fun4(copy(DT)),
  for.list = fun5(copy(DT)),
  for.set.if =fun6(copy(DT))
)

#> Unit: microseconds
#>           expr     min      lq     mean   median       uq      max neval cld
#>  for.set.2cond  59.812  67.599 131.6392  75.5620 114.6690 4561.597   100 a  
#>    for.set.ind  71.492  79.985 142.2814  87.0640 130.0650 4410.476   100 a  
#>        for.get 553.522 569.979 732.6097 581.3045 789.9365 7157.202   100   c
#>      for.SDcol 376.919 391.784 527.5202 398.3310 629.9675 5935.491   100  b 
#>       for.list  69.722  81.932 137.2275  87.7720 123.6935 3906.149   100 a  
#>     for.set.if  52.380  58.397 116.1909  65.1215  72.5535 4570.445   100 a  
Warp answered 4/1, 2019 at 15:12 Comment(0)
I
2

You need tidyverse purrr function map_if along with ifelse to do the job in a single line of code.

library(tidyverse)
set.seed(24)
DT <- data.table(v1= sample(c(1:3,NA),20,replace = T), v2 = sample(c(LETTERS[1:3],NA),20,replace = T), v3=sample(c(1:3,NA),20,replace = T))

Below single line code takes a DT with numeric and non numeric columns and operates just on the numeric columns to replace the NAs to 0:

DT %>% map_if(is.numeric,~ifelse(is.na(.x),0,.x)) %>% as.data.table

So, tidyverse can be less verbose than data.table sometimes :-)

Ibson answered 31/8, 2020 at 3:17 Comment(2)
May I ask why my answer was downvoted? Did it not work ?Ibson
Thanks, helped me plenty!Parry

© 2022 - 2024 — McMap. All rights reserved.