do.call rbind of data.table depends on location of NA
Asked Answered
B

1

7

Consider this

do.call(rbind, list(data.table(x=1, b='x'),data.table(x=1, b=NA)))

returns

   x  b
1: 1  x
2: 1 NA

but

do.call(rbind, list(data.table(x=1, b=NA),data.table(x=1, b='x')))

returns

   x  b
1: 1 NA
2: 1 NA

How can i force the first behavior, without reordering the contents of the list?

Data table is really really faster in mapreduce jobs (calling data.table ~10*3MM times across 55 nodes, the data table is many many times faster than data frame, so i want this to work ...) Regards saptarshi

Bedroom answered 27/8, 2013 at 20:45 Comment(3)
I'm guessing this happens because NA is logical and as.logical('x')=NA, so when rbind decides that that column is logical (based on its first argument), it coerces subsequent items to match. do.call(rbind, list(data.table(x=1, b=as(NA,'character')),data.table(x=1, b='x'))) works...Mcghee
By the way, there is an "optimized do.call(rbind,...)" for data.tables called rbindlist. There are a few q's about it on this site, e.g., #15674050Mcghee
@Mcghee -- Very helpful comments. I've added a reference to rbindlist to my answer.Hearst
B
9

As noted by Frank, the problem is that there are (somewhat invisibly) several different types of NA. The one produced when you type NA at the command line is of class "logical", but there are also NA_integer_, NA_real_, NA_character_, and NA_complex_.

In your first example, the initial data.table sets the class of column b to "character", and the NA in the second data.table is then coerced to an NA_character_. In the second example, though, the NA in the first data.table sets column b's class to "logical", and, when the same column in the second data.table is coerced to "logical", it's converted to a logical NA. (Try as.logical("x") to see why.)

That's all fairly complicated (to articulate, at least), but there is a reasonably simple solution. Just create a 1-row template data.table, and prepend it to each list of data.table's you want to rbind(). It will establish the class of each column to be what you want, regardless of what data.table's follow it in the list passed to rbind(), and can be trimmed off once everything else is bound together.

library(data.table)    

## The two lists of data.tables from the OP
A <- list(data.table(x=1, b='x'),data.table(x=1, b=NA))
B <- list(data.table(x=1, b=NA),data.table(x=1, b='x'))

## A 1-row template, used to set the column types (and then removed)
DT <- data.table(x=numeric(1), b=character(1))

## Test it out
do.call(rbind, c(list(DT), A))[-1,]
#    x  b
# 1: 1  x
# 2: 1 NA
do.call(rbind, c(list(DT), B))[-1,]
#    x  b
# 1: 1 NA
# 2: 1  x

## Finally, as _also_ noted by Frank, rbindlist will likely be more efficient
rbindlist(c(list(DT), B)[-1,]
Beetner answered 27/8, 2013 at 21:8 Comment(1)
Of course that would presumably slow the rbinding down somewhat in all cases. On the other hand, it might not be too hard to add a second 'colClasses' argument to rbindlist(), allowing users to pass in either a character vector of class names or a list with elements of the desired classes.Hearst

© 2022 - 2024 — McMap. All rights reserved.