[This is kind of multiple bug-reports/feature requests in one post, but they don't necessarily make sense in isolation. Apologies for the monster post in advance. Posting here as suggested by help(data.table). Also, I'm new to R; so apologies if I'm not following best practices in my code below. I'm trying.]
1. rbindlist
crash on 6 * 8GB files (I have 128GB RAM)
First I want to report that using rbindlist to append large data.tables causes R to segfault (ubuntu 13.10, the packaged R version 3.0.1-3ubuntu1, data.table installed from within R from CRAN). The machine has 128 GiB of RAM; so, I shouldn't be running out of memory given the size of the data.
My code:
append.tables <- function(files) {
moves.by.year <- lapply(files, fread)
move <- rbindlist(moves.by.year)
rm(moves.by.year)
move[,week_end := as.Date(as.character(week_end), format="%Y%m%d")]
return(move)
}
Crash message:
append.tables crashes with this:
> system.time(move <- append.tables(files))
*** caught segfault ***
address 0x7f8e88dc1d10, cause 'memory not mapped'
Traceback:
1: rbindlist(moves.by.year)
2: append.tables(files)
3: system.time(move <- append.tables(files))
There are 6 files, each about 8 GiB or 100 million lines long with 8 variables, tab separated.
2. Could fread
accept multiple file names?
In any case, I think a better approach here would be to allow fread to take files as a vector of file names:
files <- c("my", "files", "to be", "appended")
dt <- fread(files)
Presumably you can be much more memory efficient under the hood than without having to keep all of these objects around at the same time as appears to be necessary as a user of R.
3. colClasses
gives an error message
My second problem is that I need to specify a custom coercion handler for one of my data types, but that fails:
dt <- fread(tfile, colClasses=list(date="myDate"))
Error in fread(tfile, colClasses = list(date = "myDate")) :
Column name 'myDate' in colClasses not found in data
Yes, in the case of dates, a simple:
dt[,date := as.Date(as.character(date), format="%Y%m%d")]
works.
However, I have a different use case, which is to strip the decimal point from one of the data columns before it is converted from a character. Precision here is extremely important (thus our need for using the integer type), and coercing to an integer from the double type results in lost precision.
Now, I can get around this with some system() calls to append the files and pipe them through some sed magic (simplified here) (where tfile is another temporary file):
if (has_header) {
tfile2 <- tempfile()
system(paste("echo fakeline >>", tfile2))
system(paste("head -q -n1", files[[1]], ">>", tfile2))
system(paste("tail -q -n+2", tfile2, paste(files, collapse=" "),
" | sed 's/\\.//' >>", tfile), wait=wait)
unlink(tfile2)
} else {
system(paste("cat", paste(files, collapse=" "), ">>", tfile), wait=wait)
}
but this involves an extra read/write cycle. I have 4 TiB of data to process, which is a LOT of extra reading and writing (no, not all into one data.table. About 1000 of them.)
4. fread
thinks named pipes are empty files
I typically leave wait=TRUE. But I was trying to see if I could avoid the extra read/write cycle by making tfile a named pipe system('mkfifo', tfile)
, setting wait=FALSE, and then running fread(tfile). However, fread complains about the pipe being an empty file:
system(paste("tail -q -n+2", tfile2, paste(files, collapse=" "),
" | sed 's/\\.//' >>", tfile), wait=FALSE)
move <- fread(tfile)
Error in fread(tfile) : File is empty: /tmp/RtmpbxNI1L/file78a678dc1999
In any case, this is a bit of a hack.
Simplified Code if I had my wish list
Ideally, I would be able to do something like this:
setClass("Int_Price")
setAs("character", "Int_Price",
function (from) {
return(as.integer(gsub("\\.", "", from)))
}
)
dt <- fread(files, colClasses=list(price="Int_Price"))
And then I'd have a nice long data.table
with properly coerced data.