Fastest way to read in 100,000 .dat.gz files
Asked Answered
M

3

14

I have a few hundred thousand very small .dat.gz files that I want to read into R in the most efficient way possible. I read in the file and then immediately aggregate and discard the data, so I am not worried about managing memory as I get near the end of the process. I just really want to speed up the bottleneck, which happens to be unzipping and reading in the data.

Each dataset consists of 366 rows and 17 columns. Here is a reproducible example of what I am doing so far:

Building reproducible data:

require(data.table)

# Make dir
system("mkdir practice")

# Function to create data
create_write_data <- function(file.nm) {
  dt <- data.table(Day=0:365)
  dt[, (paste0("V", 1:17)) := lapply(1:17, function(x) rnorm(n=366))]
  write.table(dt, paste0("./practice/",file.nm), row.names=FALSE, sep="\t", quote=FALSE)
  system(paste0("gzip ./practice/", file.nm))    
}

And here is code applying:

# Apply function to create 10 fake zipped data.frames (550 kb on disk)
tmp <- lapply(paste0("dt", 1:10,".dat"), function(x) create_write_data(x))

And here is my most efficient code so far to read in the data:

# Function to read in files as fast as possible
read_Fast <- function(path.gz) {
  system(paste0("gzip -d ", path.gz)) # Unzip file
  path.dat <- gsub(".gz", "", path.gz)
  dat_run <- fread(path.dat)
}

# Apply above function
dat.files <- list.files(path="./practice", full.names = TRUE)
system.time(dat.list <- rbindlist(lapply(dat.files, read_Fast), fill=TRUE))
dat.list

I have bottled this up in a function and applied it in parallel, but it is still much much too slow for what I need this for.

I have already tried the h2o.importFolder from the wonderful h2o package, but it is actually much much slower compared to using plain R with data.table. Maybe there is a way to speed up the unzipping of files, but I am unsure. From the few times that I have run this, I have noticed that the unzipping of the files usually takes about 2/3rd of the function time.

Megaphone answered 3/3, 2016 at 5:0 Comment(1)
I'm getting improved speeds (in comparison to your most efficient code so far) by using read_tsv from the "readr" package. rbindlist(lapply(dat.files, read_tsv))Kirkland
S
13

I'm sort of surprised that this actually worked. Hopefully it works for your case. I'm quite curious to know how speed compares to reading in compressed data from disk directly from R (albeit with a penalty for non-vectorization) instead.

tblNames = fread('cat *dat.gz | gunzip | head -n 1')[, colnames(.SD)]
tbl = fread('cat *dat.gz | gunzip | grep -v "^Day"')
setnames(tbl, tblNames)
tbl
Shun answered 4/3, 2016 at 0:55 Comment(4)
Surprised too. This is amazing. Any idea how well it compares in terms of speed to the other methods?Psoas
I just edited the answer. Wondering as well, as the OP has an excellent test environment...Shun
Great answer! Using this method, I was able to read and aggregate my data much much faster. Using 8 cores, I was able to read in and process 696,000 files in 1.5 minutes, where before it took 12 minutes. I will next need to scale this to millions of files, so this is a huge help! Can I ask what the grep -v "^Day" part of the code is doing?Megaphone
@Mike it's removing line 1 (the headers) in each of the files. It's a bit of a kludge, but I haven't found a cleaner way to do it, and is probably fine in general if you have a numeric first column. The '^' is an anchor. So it's saying "remove all lines where the first three characters of the line are Day"Shun
F
7

R has the ability to read gzipped files natively, using the gzfile function. See if this works.

rbindlist(lapply(dat.files, function(f) {
    read.delim(gzfile(f))
}))
Faulkner answered 3/3, 2016 at 6:28 Comment(2)
You can simplify this to rbindlist(lapply(dat.files, read.delim)), by the way. +1. This seems faster than read_tsv too.Kirkland
This did help quite a bit. I am now able to read in 232,000 files in 12 minutes instead of 18. I need this to be quite a bit faster still, but this is a great startMegaphone
S
4

The bottleneck might be caused by the use of the system() call to an external application.

You should try using the builting functions to extract the archive. This answer explains how: Decompress gz file using R

Selfstarter answered 3/3, 2016 at 5:39 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.