Fastest way to replace NAs in a large data.table
Asked Answered
X

11

188

I have a large data.table, with many missing values scattered throughout its ~200k rows and 200 columns. I would like to re code those NA values to zeros as efficiently as possible.

I see two options:
1: Convert to a data.frame, and use something like this
2: Some kind of cool data.table sub setting command

I'll be happy with a fairly efficient solution of type 1. Converting to a data.frame and then back to a data.table won't take too long.

Xylia answered 29/8, 2011 at 20:35 Comment(9)
Why do you want to convert the data.table to a data.frame? A data.table is a data.frame. Any data.frame operation will just work.Narcis
@Andrie. a key difference is that you cant access a column in a data.table by specifying column number. so DT[,3] will not give the third column. i think this makes the solution proposed in the link unviable here. i am sure there is an elegant approach using some data.table wizardry!Phalansterian
@Ramnath: you're exactly right! At the moment, I'm converting to a data.frame and looping through the columns to replace the NAs, but I'm looking for some elegant data.table wizardry!Xylia
@Zach. what function are you exactly going to be applying on the data table. it might be fine retaining the NAs and dealing with them later. if you provide more details, i think it would be easier to come up with a data.table solution.Phalansterian
@Ramnath: The situation is that the data table is the result of a large aggregation and merging operation. The values in most of the columns are counts, and the NAs are in fact zeros (no counts=no observations). I need to recode these NAs as zeros because I am going to be building a predictive model using this dataset.Xylia
@Ramnath, AFAIK, DT[, 3, with=FALSE] returns the third column.Narcis
@Andrie. good point. i completely forgot about the with argument.Phalansterian
@Andrie. but there is still a problem mydf[is.na(mydf) == TRUE] does the job on data frames, while mydt[is.na(mydt) == TRUE] gives me something strange even if i use with=FALSEPhalansterian
@Ramnath, point taken. My earlier statement was too broad, i.e. I was wrong. Sorry. Data.tables only behave like data.frames when there isn't a data.table method.Narcis
K
213

Here's a solution using data.table's := operator, building on Andrie and Ramnath's answers.

require(data.table)  # v1.6.6
require(gdata)       # v2.8.2

set.seed(1)
dt1 = create_dt(2e5, 200, 0.1)
dim(dt1)
[1] 200000    200    # more columns than Ramnath's answer which had 5 not 200

f_andrie = function(dt) remove_na(dt)

f_gdata = function(dt, un = 0) gdata::NAToUnknown(dt, un)

f_dowle = function(dt) {     # see EDIT later for more elegant solution
  na.replace = function(v,value=0) { v[is.na(v)] = value; v }
  for (i in names(dt))
    eval(parse(text=paste("dt[,",i,":=na.replace(",i,")]")))
}

system.time(a_gdata = f_gdata(dt1)) 
   user  system elapsed 
 18.805  12.301 134.985 

system.time(a_andrie = f_andrie(dt1))
Error: cannot allocate vector of size 305.2 Mb
Timing stopped at: 14.541 7.764 68.285 

system.time(f_dowle(dt1))
  user  system elapsed 
 7.452   4.144  19.590     # EDIT has faster than this

identical(a_gdata, dt1)   
[1] TRUE

Note that f_dowle updated dt1 by reference. If a local copy is required then an explicit call to the copy function is needed to make a local copy of the whole dataset. data.table's setkey, key<- and := do not copy-on-write.

Next, let's see where f_dowle is spending its time.

Rprof()
f_dowle(dt1)
Rprof(NULL)
summaryRprof()
$by.self
                  self.time self.pct total.time total.pct
"na.replace"           5.10    49.71       6.62     64.52
"[.data.table"         2.48    24.17       9.86     96.10
"is.na"                1.52    14.81       1.52     14.81
"gc"                   0.22     2.14       0.22      2.14
"unique"               0.14     1.36       0.16      1.56
... snip ...

There, I would focus on na.replace and is.na, where there are a few vector copies and vector scans. Those can fairly easily be eliminated by writing a small na.replace C function that updates NA by reference in the vector. That would at least halve the 20 seconds I think. Does such a function exist in any R package?

The reason f_andrie fails may be because it copies the whole of dt1, or creates a logical matrix as big as the whole of dt1, a few times. The other 2 methods work on one column at a time (although I only briefly looked at NAToUnknown).

EDIT (more elegant solution as requested by Ramnath in comments) :

f_dowle2 = function(DT) {
  for (i in names(DT))
    DT[is.na(get(i)), (i):=0]
}

system.time(f_dowle2(dt1))
  user  system elapsed 
 6.468   0.760   7.250   # faster, too

identical(a_gdata, dt1)   
[1] TRUE

I wish I did it that way to start with!

EDIT2 (over 1 year later, now)

There is also set(). This can be faster if there are a lot of column being looped through, as it avoids the (small) overhead of calling [,:=,] in a loop. set is a loopable :=. See ?set.

f_dowle3 = function(DT) {
  # either of the following for loops

  # by name :
  for (j in names(DT))
    set(DT,which(is.na(DT[[j]])),j,0)

  # or by number (slightly faster than by name) :
  for (j in seq_len(ncol(DT)))
    set(DT,which(is.na(DT[[j]])),j,0)
}
Kianakiang answered 30/8, 2011 at 20:33 Comment(14)
+! great answer! is it possible to have a more intuitive equivalent of the eval(parse)... stuff. on a broader note, i think it would be useful to have operations that work on all elements of the data.table.Phalansterian
@Phalansterian See edit: avoids eval(parse) and is much faster, too. Yes we could do ops that work on all elements. There is an open roadmap for data.table - feel free to add feature requests there and that way it won't get forgotten.Kianakiang
Your 2nd block of code seems to be the most data.table appropriate way to do this. Thanks!Xylia
hmmm, the current version of data.table seems to break your most recent function. Also, create_dt no longer exists!Xylia
@Xylia Hi. create_dt is in Andrie's answer. What's the error? When you say current version, you mean 1.8.2 on CRAN, or 1.8.3 on R-Forge?Kianakiang
@MatthewDowle: Hmmm, and this morning it's working fine. I'll have to go back and write a reproducible example, sorry!Xylia
@MattDowle it seems to me that f_dowle3 was broken again in data.table_1.9.4. I get a complaint from set(), Warning message: In set(DT, which(is.na(DT[[j]])), j, 0) : Coerced 'double' RHS to 'logical' to match the column's typeMultiparous
@Multiparous I guess your DT has columns of type logical, unlike the create_dt() example for this test. Change the 4th argument of the set() call (which is 0 in your example and type double in R) to FALSE and it should work without warning.Kianakiang
@Multiparous And I've filed a feature request to relax this case and drop that warning when coercing length-1 vectors 0 and 1 to logical: #996. Might not do it since, for speed, you do want to be warned about unnecessary repetitive coercing.Kianakiang
Thanks for the answer. It's REALLY fast. 6.1mln x 196 data.table got treated in 10 seconds.Evaporite
Thanks for the solution and all the benchmarking. Tiny nitpick: you can replace seq_len(ncol(DT)) by seq_along(DT)Seraphim
@StefanF True and I prefer seq_along(DT) too. But then the reader has to know that seq_along would be along the columns and not down the rows. seq_len(col(DT)) a tiny bit more explicit for that reason.Kianakiang
For my data, the answer here seemed much quicker, maybe worth a try: #20536005Fabrienne
@MattDowle I have been trying to adapt your code to take into account column type, and replace negative values with NA's instead of replacing NA's with 0. I have however not been very successful. Would you perhaps be willing to take a look? #52382798Zorina
G
55

Here's the simplest one I could come up with:

dt[is.na(dt)] <- 0

It's efficient and no need to write functions and other glue code.

Gyrostatic answered 12/10, 2016 at 13:9 Comment(5)
doesn't work on big datasets and normal workstation computers (memory allocation error)Zeculon
@Zeculon on a machine with 16GB of RAM I was able to run this on 31M rows, ~20 columns. YMMV of course.Gyrostatic
I defer to your empirical evidence. Thanks.Zeculon
Unfortunatelly in the latest versions of data.table it doesn't work. It says Error in [.data.table(dt, is.na(dt)) : i is invalid type (matrix). Perhaps in future a 2 column matrix could return a list of elements of DT (in the spirit of A[B] in FAQ 2.14). Please let datatable-help know if you'd like this, or add your comments to FR #657. >Cascade
this is interesting! I always used setPratfall
E
34

Dedicated functions (nafill and setnafill) for that purpose are available in data.table package (version >= 1.12.4):

It process columns in parallel so well address previously posted benchmarks, below its timings vs fastest approach till now, and also scaled up, using 40 cores machine.

library(data.table)
create_dt <- function(nrow=5, ncol=5, propNA = 0.5){
  v <- runif(nrow * ncol)
  v[sample(seq_len(nrow*ncol), propNA * nrow*ncol)] <- NA
  data.table(matrix(v, ncol=ncol))
}
f_dowle3 = function(DT) {
  for (j in seq_len(ncol(DT)))
    set(DT,which(is.na(DT[[j]])),j,0)
}

set.seed(1)
dt1 = create_dt(2e5, 200, 0.1)
dim(dt1)
#[1] 200000    200
dt2 = copy(dt1)
system.time(f_dowle3(dt1))
#   user  system elapsed 
#  0.193   0.062   0.254 
system.time(setnafill(dt2, fill=0))
#   user  system elapsed 
#  0.633   0.000   0.020   ## setDTthreads(1) elapsed: 0.149
all.equal(dt1, dt2)
#[1] TRUE

set.seed(1)
dt1 = create_dt(2e7, 200, 0.1)
dim(dt1)
#[1] 20000000    200
dt2 = copy(dt1)
system.time(f_dowle3(dt1))
#   user  system elapsed 
# 22.997  18.179  41.496
system.time(setnafill(dt2, fill=0))
#   user  system elapsed 
# 39.604  36.805   3.798 
all.equal(dt1, dt2)
#[1] TRUE
Evansville answered 3/2, 2019 at 15:32 Comment(2)
That's a great feature! Are you planning to add support for character columns? Then it could be used here.Sulamith
@Sulamith yes, you can track this feature here github.com/Rdatatable/data.table/issues/3992Evansville
G
19
library(data.table)

DT = data.table(a=c(1,"A",NA),b=c(4,NA,"B"))

DT
    a  b
1:  1  4
2:  A NA
3: NA  B

DT[,lapply(.SD,function(x){ifelse(is.na(x),0,x)})]
   a b
1: 1 4
2: A 0
3: 0 B

Just for reference, slower compared to gdata or data.matrix, but uses only the data.table package and can deal with non numerical entries.

Glairy answered 20/7, 2015 at 22:18 Comment(1)
You could probably both avoid ifelse and update by reference by doing DT[, names(DT) := lapply(.SD, function(x) {x[is.na(x)] <- "0" ; x})]. And I doubt it will be slower that than answers you've mentioned.Activism
P
11

Here is a solution using NAToUnknown in the gdata package. I have used Andrie's solution to create a huge data table and also included time comparisons with Andrie's solution.

# CREATE DATA TABLE
dt1 = create_dt(2e5, 200, 0.1)

# FUNCTIONS TO SET NA TO ZERO   
f_gdata  = function(dt, un = 0) gdata::NAToUnknown(dt, un)
f_Andrie = function(dt) remove_na(dt)

# COMPARE SOLUTIONS AND TIMES
system.time(a_gdata  <- f_gdata(dt1))

user  system elapsed 
4.224   2.962   7.388 

system.time(a_andrie <- f_Andrie(dt1))

 user  system elapsed 
4.635   4.730  20.060 

identical(a_gdata, g_andrie)  

TRUE
Phalansterian answered 30/8, 2011 at 17:36 Comment(3)
+1 Good find. Interesting - it's the first time I see timings with similar user time but really big difference in elapsed time.Narcis
@Narcis I tried using rbenchmark to benchmark solutions using more replications, but got an out of memory error possibly due to the size of the data frame. if you can run benchmark on both these solutions with multiple replications, those results would be interesting as i am not really sure why i am getting a 3x speedupPhalansterian
@Phalansterian To get things correct, the timings in this answer are for ncol=5 I think (should take much longer) due to the bug in create_dt.Kianakiang
N
6

My understanding is that the secret to fast operations in R is to utilise vector (or arrays, which are vectors under the hood.)

In this solution I make use of a data.matrix which is an array but behave a bit like a data.frame. Because it is an array, you can use a very simple vector substitution to replace the NAs:

A little helper function to remove the NAs. The essence is a single line of code. I only do this to measure execution time.

remove_na <- function(x){
  dm <- data.matrix(x)
  dm[is.na(dm)] <- 0
  data.table(dm)
}

A little helper function to create a data.table of a given size.

create_dt <- function(nrow=5, ncol=5, propNA = 0.5){
  v <- runif(nrow * ncol)
  v[sample(seq_len(nrow*ncol), propNA * nrow*ncol)] <- NA
  data.table(matrix(v, ncol=ncol))
}

Demonstration on a tiny sample:

library(data.table)
set.seed(1)
dt <- create_dt(5, 5, 0.5)

dt
            V1        V2        V3        V4        V5
[1,]        NA 0.8983897        NA 0.4976992 0.9347052
[2,] 0.3721239 0.9446753        NA 0.7176185 0.2121425
[3,] 0.5728534        NA 0.6870228 0.9919061        NA
[4,]        NA        NA        NA        NA 0.1255551
[5,] 0.2016819        NA 0.7698414        NA        NA

remove_na(dt)
            V1        V2        V3        V4        V5
[1,] 0.0000000 0.8983897 0.0000000 0.4976992 0.9347052
[2,] 0.3721239 0.9446753 0.0000000 0.7176185 0.2121425
[3,] 0.5728534 0.0000000 0.6870228 0.9919061 0.0000000
[4,] 0.0000000 0.0000000 0.0000000 0.0000000 0.1255551
[5,] 0.2016819 0.0000000 0.7698414 0.0000000 0.0000000
Narcis answered 29/8, 2011 at 21:31 Comment(4)
That's a very nice example dataset. I'll try and improve on remove_na. That timing of 21.57s includes the create_dt (including runif and sample) together with the remove_na. Any chance you could edit to split out the 2 times?Kianakiang
Is there a small bug in create_dt? It seems to always create a 5 column data.table regardless of ncol passed in.Kianakiang
@MatthewDowle Well spotted. Error removed (as well as the timings)Narcis
Converting to matrix will only work properly if all columns are the same type.Cascade
L
6

For the sake of completeness, another way to replace NAs with 0 is to use

f_rep <- function(dt) {
dt[is.na(dt)] <- 0
return(dt)
}

To compare results and times I have incorporated all approaches mentioned so far.

set.seed(1)
dt1 <- create_dt(2e5, 200, 0.1)
dt2 <- dt1
dt3 <- dt1

system.time(res1 <- f_gdata(dt1))
   User      System verstrichen 
   3.62        0.22        3.84 
system.time(res2 <- f_andrie(dt1))
   User      System verstrichen 
   2.95        0.33        3.28 
system.time(f_dowle2(dt2))
   User      System verstrichen 
   0.78        0.00        0.78 
system.time(f_dowle3(dt3))
   User      System verstrichen 
   0.17        0.00        0.17 
system.time(res3 <- f_unknown(dt1))
   User      System verstrichen 
   6.71        0.84        7.55 
system.time(res4 <- f_rep(dt1))
   User      System verstrichen 
   0.32        0.00        0.32 

identical(res1, res2) & identical(res2, res3) & identical(res3, res4) & identical(res4, dt2) & identical(dt2, dt3)
[1] TRUE

So the new approach is slightly slower than f_dowle3 but faster than all the other approaches. But to be honest, this is against my Intuition of the data.table Syntax and I have no idea why this works. Can anybody enlighten me?

Lobscouse answered 15/10, 2015 at 8:16 Comment(2)
Yes I checked them, this is why I have included the pairwise identicals.Lobscouse
Here's a reason why it's not the idiomatic way - stackoverflow.com/a/20545629Vent
A
4

Using the fifelse function from the newest data.table versions 1.12.6, it is even 10 times faster than NAToUnknown in the gdata package:

z = data.table(x = sample(c(NA_integer_, 1), 2e7, TRUE))
system.time(z[,x1 := gdata::NAToUnknown(x, 0)])

#   user  system elapsed 
#  0.798   0.323   1.173 
system.time(z[,x2:= fifelse(is.na(x), 0, x)])

#   user  system elapsed 
#  0.172   0.093   0.113 
Adkison answered 7/11, 2019 at 15:17 Comment(1)
Can you add some timing comparisons to this answer? I think f_dowle3 will still be faster: https://mcmap.net/q/54462/-fastest-way-to-replace-nas-in-a-large-data-tableXylia
F
3

To generalize to many columns you could use this approach (using previous sample data but adding a column):

z = data.table(x = sample(c(NA_integer_, 1), 2e7, TRUE), y = sample(c(NA_integer_, 1), 2e7, TRUE))

z[, names(z) := lapply(.SD, function(x) fifelse(is.na(x), 0, x))]

Didn't test for the speed though

Fanny answered 20/2, 2020 at 13:59 Comment(0)
L
1
> DT = data.table(a=LETTERS[c(1,1:3,4:7)],b=sample(c(15,51,NA,12,21),8,T),key="a")
> DT
   a  b
1: A 12
2: A NA
3: B 15
4: C NA
5: D 51
6: E NA
7: F 15
8: G 51
> DT[is.na(b),b:=0]
> DT
   a  b
1: A 12
2: A  0
3: B 15
4: C  0
5: D 51
6: E  0
7: F 15
8: G 51
> 
Ladybird answered 20/3, 2016 at 13:12 Comment(2)
And how would you generalize this to more than one columns though?Activism
@DavidArenburg just write a for loop. This should be the accepted answer: it's the simplest!Flameproof
H
1

A fast alternative is collapse::replace_NA, which by default replaces NAs with 0.

library(collapse)
replace_NA(df)

Microbenchmark on a data.frame with 10 columns of 1M rows and 10% NAs.

# Unit: milliseconds
#        expr     min       lq     mean   median       uq      max neval
#    f_dowle3 30.9308 34.44890 50.46713 47.33065 58.04780 160.4836   100
#   setnafill 10.3389 10.92065 12.34867 11.79305 13.41090  22.0207   100
#  replace_NA  9.9896 10.98030 15.19177 12.87030 17.00505  83.5094   100

Code:

library(data.table)
library(collapse)
library(microbenchmark)

set.seed(1)
df <- as.data.frame(replicate(10, runif(1e6)))
df <- na_insert(df, prop = 0.1)
dt <- df
setDT(dt)

f_dowle3 = function() {
  for (j in seq_len(ncol(dt)))
    set(dt, which(is.na(dt[[j]])),j,0)
}

mb <- 
  microbenchmark(
  f_dowle3 = f_dowle3(),
  setnafill = data.table::setnafill(dt, fill = 0),
  replace_NA = collapse::replace_NA(dt, set = TRUE)
)
Haddad answered 3/8, 2023 at 9:49 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.