append multiple large data.table's; custom data coercion using colClasses and fread; named pipes
Asked Answered
S

1

8

[This is kind of multiple bug-reports/feature requests in one post, but they don't necessarily make sense in isolation. Apologies for the monster post in advance. Posting here as suggested by help(data.table). Also, I'm new to R; so apologies if I'm not following best practices in my code below. I'm trying.]

1. rbindlist crash on 6 * 8GB files (I have 128GB RAM)

First I want to report that using rbindlist to append large data.tables causes R to segfault (ubuntu 13.10, the packaged R version 3.0.1-3ubuntu1, data.table installed from within R from CRAN). The machine has 128 GiB of RAM; so, I shouldn't be running out of memory given the size of the data.

My code:

append.tables <- function(files) {
    moves.by.year <- lapply(files, fread)
    move <- rbindlist(moves.by.year)
    rm(moves.by.year)
    move[,week_end := as.Date(as.character(week_end), format="%Y%m%d")]
    return(move)
}

Crash message:

 append.tables crashes with this:
> system.time(move <- append.tables(files))
 *** caught segfault ***
address 0x7f8e88dc1d10, cause 'memory not mapped'

Traceback:
 1: rbindlist(moves.by.year)
 2: append.tables(files)
 3: system.time(move <- append.tables(files))

There are 6 files, each about 8 GiB or 100 million lines long with 8 variables, tab separated.

2. Could fread accept multiple file names?

In any case, I think a better approach here would be to allow fread to take files as a vector of file names:

files <- c("my", "files", "to be", "appended")
dt <- fread(files)

Presumably you can be much more memory efficient under the hood than without having to keep all of these objects around at the same time as appears to be necessary as a user of R.

3. colClasses gives an error message

My second problem is that I need to specify a custom coercion handler for one of my data types, but that fails:

dt <- fread(tfile, colClasses=list(date="myDate"))
Error in fread(tfile, colClasses = list(date = "myDate")) : 
  Column name 'myDate' in colClasses not found in data

Yes, in the case of dates, a simple:

    dt[,date := as.Date(as.character(date), format="%Y%m%d")]

works.

However, I have a different use case, which is to strip the decimal point from one of the data columns before it is converted from a character. Precision here is extremely important (thus our need for using the integer type), and coercing to an integer from the double type results in lost precision.

Now, I can get around this with some system() calls to append the files and pipe them through some sed magic (simplified here) (where tfile is another temporary file):

if (has_header) {
    tfile2 <- tempfile()
    system(paste("echo fakeline >>", tfile2))
    system(paste("head -q -n1", files[[1]], ">>", tfile2))
    system(paste("tail -q -n+2", tfile2, paste(files, collapse=" "),
                 " | sed 's/\\.//' >>", tfile), wait=wait)
    unlink(tfile2)
} else {
    system(paste("cat", paste(files, collapse=" "), ">>", tfile), wait=wait)
}

but this involves an extra read/write cycle. I have 4 TiB of data to process, which is a LOT of extra reading and writing (no, not all into one data.table. About 1000 of them.)

4. fread thinks named pipes are empty files

I typically leave wait=TRUE. But I was trying to see if I could avoid the extra read/write cycle by making tfile a named pipe system('mkfifo', tfile), setting wait=FALSE, and then running fread(tfile). However, fread complains about the pipe being an empty file:

system(paste("tail -q -n+2", tfile2, paste(files, collapse=" "),
             " | sed 's/\\.//' >>", tfile), wait=FALSE)
move <- fread(tfile)
Error in fread(tfile) : File is empty: /tmp/RtmpbxNI1L/file78a678dc1999

In any case, this is a bit of a hack.

Simplified Code if I had my wish list

Ideally, I would be able to do something like this:

setClass("Int_Price")
setAs("character", "Int_Price",
    function (from) {
        return(as.integer(gsub("\\.", "", from)))
    }
)

dt <- fread(files, colClasses=list(price="Int_Price"))

And then I'd have a nice long data.table with properly coerced data.

Scalping answered 19/1, 2014 at 17:51 Comment(2)
Great! Thanks for taking your time to write down these points. It'd be even more helpful if you could please file them on data.table project page. Scroll down to get the links to bugs and feature requests. On bugs, it'd be difficult to do anything about it unless we've a reproducible example. These many questions is very unlikely to get answered (and even closed) as it's against SO policy.Flophouse
You should file these as individual feature requests (FRs) / bugs, even though they appear collective to you.Flophouse
G
7

Update: The rbindlist bug has been fixed in commit 1100 v1.8.11. From NEWS:

o Fixed a rare segfault that occurred on >250m rows (integer overflow during memory allocation); closes #5305. Thanks to Guenter J. Hitsch for reporting.


As mentioned in comments, you're supposed to ask separate questions separately. But since they're such good points and linked together into the wish at the end, ok, will answer in one go.

1. rbindlist crash on 6 * 8GB files (I have 128GB RAM)

Please run again changing the line :

moves.by.year <- lapply(files, fread)

to

moves.by.year <- lapply(files, fread, verbose=TRUE)

and send me the output. I don't think it is the size of the files, but something about the type and contents. You're right that fread and rbindlist should have no issue loading the 48GB of data on your 128GB box. As you say, the lapply should return 48GB and then the rbindlist should create a new 48GB single table. This should work on your 128GB machine since 96GB < 128GB. 100 million rows * 6 is 600 million rows, which is well under the 2 billion row limit so should be fine (data.table hasn't caught up with long vector support in R3 yet, otherwise > 2^31 rows would be fine, too).

2. Could fread accept multiple file names?

Excellent idea. As you say, fread could then sweep through all 6 files detecting their types and counting the total number of rows, first. Then allocate once for the 600 million rows directly. This would save churning through 48GB of RAM needlessly. It might also detect any anomalies in the 5th or 6th file (say) before starting to read the first files, so would return quicker in the event of problems.

I'll file this as a feature request and post the link here.

3. colClasses gives an error message

When type list, the type appears to the left of the = and a vector of column names or positions appears to the right. The idea is to be easier than colClasses in read.csv which only accepts a vector; to save repeating "character" over and over. I could have sworn this was better documented in ?fread but it seems not. I'll take a look at that.

So, instead of

fread(tfile, colClasses=list(date="myDate"))
Error in fread(tfile, colClasses = list(date = "myDate")) : 
    Column name 'myDate' in colClasses not found in data

the correct syntax is

fread(tfile, colClasses=list(myDate="date"))

Given what you go on to say in the question, iiuc, you actually want :

fread(tfile, colClasses=list(character="date"))  # just fread accepts list

or

fread(tfile, colClasses=c("date"="character"))   # both read.csv and fread

Either of those should load the column called "date" as character so you can manipulate it before coercion. If it really is just dates, then I've still to implement that coercion automatically. You mentioned precision of numeric so just to remind that integer64 can be read directly by fread too.

4. fread thinks named pipes are empty files

Hopefully this goes away now assuming the previous point is resolved? fread works by memory mapping its input. It can accept non-files such as http addresses and connections (tbc) and what it does first for convenience is to write the complete input to ramdisk so it can map the input from there. The reason fread is fast is hand in hand with seeing the entire input first.

Galumph answered 20/1, 2014 at 19:53 Comment(6)
rbindlist bug filed. fread request updatedScalping
Thanks for your help. I should note that read.table/csv do in fact accept a list for colClasses. E.g. passing colClasses=list(integer_var="character") works just fine. I have to say fread's deviation from this behavior is surprising.Scalping
fread doesn't handle custom classes in colClasses properlyScalping
@Scalping Thanks, will take a look. colClasses=list(...) isn't documented in ?read.csv, though. My reading is that colClasses is supposed to be a character vector. If you're passing in a list and it's working, then that's lucky and not guaranteed to work in future (I assume it's converting the list to a character vector currently). Do you see the advantage of fread(,colClasses=list(character=150:200)) that data.table is trying to provide? Any other way to do that?Galumph
I do see the advantage for wide data to only have to specify each type once. That said, I try to avoid referring to column numbers except in fixed-width data due to bugs that can be introduced by new column orderings. That said, I was just surprised by the difference and thought it a bug. But you are right. Now that I look closer at read.table, it doesn't exactly document the behavior I've seen. That said, I'm more concerned that I can't specify custom handlers for columns (see my previous comment). This is really constraining my ability to read in large datasets right now.Scalping
@Scalping Any custom colClasses are being read as character currently, right? Just for (i in thosecolumns) DT[,(i):=as.myclass(get(i))] afterwards, or similar using set() if you prefer. Just until I can get that done automatically for you so you don't have this hassle. If that's ok?Galumph

© 2022 - 2024 — McMap. All rights reserved.