Replace NA with 0, only in numeric columns in data.table

Asked 23/5, 2016 at 12:53 Answered 31/8, 2020 at 3:17

I have a data.table with columns of different data types. My goal is to select only numeric columns and replace NA values within these columns by 0. I am aware that replacing na-values with zero goes like this:

DT[is.na(DT)] <- 0

To select only numeric columns, I found this solution, which works fine:

DT[, as.numeric(which(sapply(DT,is.numeric))), with = FALSE]

I can achieve what I want by assigning

DT2 <- DT[, as.numeric(which(sapply(DT,is.numeric))), with = FALSE]

and then do:

DT2[is.na(DT2)] <- 0

But of course I would like to have my original DT modified by reference. With the following, however:

DT[, as.numeric(which(sapply(DT,is.numeric))), with = FALSE]
                 [is.na(DT[, as.numeric(which(sapply(DT,is.numeric))), with = FALSE])]<- 0

I get

"Error in [.data.table([...] i is invalid type (matrix)"

What am I missing? Any help is much appreciated!!

Dempsey answered 23/5, 2016 at 12:53 Comment(4)

You are missing the basic syntax of data.tables, which don't do DT[...] <- y. Try reading the vignettes github.com/Rdatatable/data.table/wiki/Getting-started It's a more efficient way to learn than "finding solutions" for each step you think you need to take. The answer below doesn't even require the with=FALSE trick you found. – Polycarp 23/5, 2016 at 13:9

Thanks for the advice. Could you please eloborate on the basic syntax error "...which don't do DT[...] <- y". What does that mean? Why does the assignment work in one case and not in the other case? I could not find anything in the vignettes, would still help me alot to understand.. – Dempsey 23/5, 2016 at 13:42

Data tables shouldn't be used like DT[...] <- y where ... is whatever you have in mind. Assignment is done with := or set not with a <-. The arrow way actually does work in special cases, in the sense that the table is modified, but it does not work by reference (last I checked) and so is not idiomatic. To work with data.tables, you'll have to learn some of their idioms. If you don't already know what I mean by :=, that's a good reason to check out the vignettes. – Polycarp 23/5, 2016 at 14:17

a) It will be much more efficient to compute the column-list numeric_cols <- which(sapply(DT,is.numeric)) once-off at the top, instead of inside each j-expression, for each group. b) Then just reference DT[, numeric_cols] c) Yes, putting a function-call inside the j-expression is tricky and often tickles syntax error. – Pointer 19/4, 2018 at 0:52

We can use set

for(j in seq_along(DT)){
    set(DT, i = which(is.na(DT[[j]]) & is.numeric(DT[[j]])), j = j, value = 0)
 }

Or create a index for numeric columns, loop through it and set the NA values to 0

ind <-   which(sapply(DT, is.numeric))
for(j in ind){
    set(DT, i = which(is.na(DT[[j]])), j = j, value = 0)
}

data

set.seed(24)
DT <- data.table(v1= c(NA, 1:4), v2 = c(NA, LETTERS[1:4]), v3=c(rnorm(4), NA))

Curbing answered 23/5, 2016 at 13:2 Comment(5)

What does set( ..., j = j, ...) mean? All columns? Surely we only need to do set() on the subset of columns that are numeric, as OP asked? – Pointer 19/4, 2018 at 1:10

@Pointer Not all columns. In the code I got the ind which gets the column index of numeric columns, so, it is only looping through those columns – Curbing 19/4, 2018 at 2:6

Ok. Why can't you avoid looping, by using ind to index into names(DT) to get a list of column-names and pass that as the j-argument of set()? I guess the expression to find NAs would then need to be 2D. Well I guess set() is already fairly fast. – Pointer 19/4, 2018 at 2:59

@Pointer Not sure I don't understand your question. The j can take either columnames or the column index. Here, 'ind' is the index. – Curbing 19/4, 2018 at 4:9

Why can't you avoid the loop for(j in ind) { ... set(..., j=j, ...) } ? Can't you directly do set(DT, j=ind) in general? I think you could, but the only reason for the j-loop is that the i-expression to find NA rows for that specific j changes. – Pointer 19/4, 2018 at 4:56

I wanted to explore and possibly improve on the excellent answer given above by @akrun. Here's the data he used in his example:

library(data.table)

set.seed(24)
DT <- data.table(v1= c(NA, 1:4), v2 = c(NA, LETTERS[1:4]), v3=c(rnorm(4), NA))
DT

#>    v1   v2         v3
#> 1: NA <NA> -0.5458808
#> 2:  1    A  0.5365853
#> 3:  2    B  0.4196231
#> 4:  3    C -0.5836272
#> 5:  4    D         NA

And the two methods he suggested to use:

fun1 <- function(x){
  for(j in seq_along(x)){
  set(x, i = which(is.na(x[[j]]) & is.numeric(x[[j]])), j = j, value = 0)
  }
}

fun2 <- function(x){
  ind <-   which(sapply(x, is.numeric))
  for(j in ind){
    set(x, i = which(is.na(x[[j]])), j = j, value = 0)
  }
}

I think the first method above is really genius as it exploits the fact that NAs are typed.

First of all, even though .SD is not available in i argument, it is possible to pull the column name with get(), so I thought I could sub-assign data.table this way:

fun3 <- function(x){
  nms <- names(x)[sapply(x, is.numeric)]
  for(j in nms){
    x[is.na(get(j)), (j):=0]
  }
}

Generic case, of course would be to rely on .SD and .SDcols to work only on numeric columns

fun4 <- function(x){
  nms <- names(x)[sapply(x, is.numeric)]
  x[, (nms):=lapply(.SD, function(i) replace(i, is.na(i), 0)), .SDcols=nms]  
}

But then I thought to myself "Hey, who says we can't go all the way to base R for this sort of operation. Here's simple lapply() with conditional statement, wrapped into setDT()

fun5 <- function(x){
setDT(
  lapply(x, function(i){
    if(is.numeric(i))
         i[is.na(i)]<-0
    i
  })
)
}

Finally,we could use the same idea of conditional to limit the columns on which we apply the set()

fun6 <- function(x){
  for(j in seq_along(x)){
    if (is.numeric(x[[j]]) )
      set(x, i = which(is.na(x[[j]])), j = j, value = 0)
  }
}

Here are the benchmarks:

microbenchmark::microbenchmark(
  for.set.2cond = fun1(copy(DT)),
  for.set.ind = fun2(copy(DT)),
  for.get = fun3(copy(DT)),
  for.SDcol = fun4(copy(DT)),
  for.list = fun5(copy(DT)),
  for.set.if =fun6(copy(DT))
)

#> Unit: microseconds
#>           expr     min      lq     mean   median       uq      max neval cld
#>  for.set.2cond  59.812  67.599 131.6392  75.5620 114.6690 4561.597   100 a  
#>    for.set.ind  71.492  79.985 142.2814  87.0640 130.0650 4410.476   100 a  
#>        for.get 553.522 569.979 732.6097 581.3045 789.9365 7157.202   100   c
#>      for.SDcol 376.919 391.784 527.5202 398.3310 629.9675 5935.491   100  b 
#>       for.list  69.722  81.932 137.2275  87.7720 123.6935 3906.149   100 a  
#>     for.set.if  52.380  58.397 116.1909  65.1215  72.5535 4570.445   100 a

Warp answered 4/1, 2019 at 15:12 Comment(0)

You need tidyverse purrr function map_if along with ifelse to do the job in a single line of code.

library(tidyverse)
set.seed(24)
DT <- data.table(v1= sample(c(1:3,NA),20,replace = T), v2 = sample(c(LETTERS[1:3],NA),20,replace = T), v3=sample(c(1:3,NA),20,replace = T))

Below single line code takes a DT with numeric and non numeric columns and operates just on the numeric columns to replace the NAs to 0:

DT %>% map_if(is.numeric,~ifelse(is.na(.x),0,.x)) %>% as.data.table

So, tidyverse can be less verbose than data.table sometimes :-)

Ibson answered 31/8, 2020 at 3:17 Comment(2)

May I ask why my answer was downvoted? Did it not work ? – Ibson 31/8, 2020 at 5:53

Thanks, helped me plenty! – Parry 4/2, 2022 at 23:43

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

data

Below single line code takes a DT with numeric and non numeric columns and operates just on the numeric columns to replace the NAs to 0:

Recommended topics

Hot tags