Fast reading and combining several files using data.table (with fread)
Asked Answered
C

2

42

I have several different txt files with the same structure. Now I want to read them into R using fread, and then union them into a bigger dataset.

## First put all file names into a list 
library(data.table)
all.files <- list.files(path = "C:/Users",pattern = ".txt")

## Read data using fread
readdata <- function(fn){
    dt_temp <- fread(fn, sep=",")
    keycols <- c("ID", "date")
    setkeyv(dt_temp,keycols)  # Notice there's a "v" after setkey with multiple keys
    return(dt_temp)

}
# then using 
mylist <- lapply(all.files, readdata)
mydata <- do.call('rbind',mylist)

The code works fine, but the speed is not satisfactory. Each txt file has 1M observations and 12 fields.

If I use the fread to read a single file, it's fast. But using apply, then speed is extremely slow, and obviously take much time than reading files one by one. I wonder where went wrong here, is there any improvements for the speed gain?

I tried the llply in plyr package, there're not much speed gains.

Also, is there any syntax in data.table to achieve vertical join like rbind and union in sql?

Thanks.

Clothespin answered 16/1, 2014 at 8:0 Comment(0)
H
62

Use rbindlist() which is designed to rbind a list of data.table's together...

mylist <- lapply(all.files, readdata)
mydata <- rbindlist( mylist )

And as @Roland says, do not set the key in each iteration of your function!

So in summary, this is best :

l <- lapply(all.files, fread, sep=",")
dt <- rbindlist( l )
setkey( dt , ID, date )
Housebound answered 16/1, 2014 at 8:16 Comment(5)
Also, set they key only once at the end.Tweedy
@SimonO'Hanlon, thanks a lot. Is the for loop faster than lapply?Clothespin
@Clothespin indeterminate. But if you think about it, what do you expect 99.999% of your processing time to be? The computational overhead of either for or lapply or reading in 1e6 observations of data? It's totally arbitrary in this case. I think that memory management might be better using the for loop, and it certainly isn't worse than lapply. There will be no speed difference between the two.Gilboa
If you are calling files outside of your working directory, be sure to add full.names = TRUE to list.files() e.g. list.files(path = "C:/Users",pattern = ".txt",full.names=TRUE). This will attach the full file path to each called file, allowing the lapply function to successfully locate and operate on each file.Susette
This is often still too slow. Here are some 25-50x faster ways https://mcmap.net/q/98510/-how-to-import-multiple-csv-files-at-onceMimetic
F
2

I've re-written the code to do this way too many times.. Finally rolled it into a handy function, below.

data.table_fread_mult <- function(filepaths = NULL, dir = NULL, recursive = FALSE, pattern = NULL, fileCol = FALSE, ...){
  # fread multiple filepaths and then combine the results into a single data.table
  # This function has two interfaces: either
  # 1) provide `filepaths` as a character vector of filepaths to read or 
  # 2) provide `dir` (and optionally `pattern` and `recursive`) to identify the directory to read from
  # If fileCol = TRUE, result will incude a column called File with the full source file path of each record
  # ... should be arguments to pass on to fread()
  # `pattern` is an optional regular expression to match files (e.g. pattern='csv$' matches files ending with 'csv')
  
  if(!is.null(filepaths) & (!is.null(dir) | !is.null(pattern))){
    stop("If `filepaths` is given, `dir` and `pattern` should be NULL")
  } else if(is.null(filepaths) & is.null(dir)){
    stop("If `filepaths` is not given, `dir` should be given")
  }
  
  # If filepaths isn't given, build it from dir, recursive, pattern
  if(is.null(filepaths)){
    filepaths <- list.files(
      path = dir, 
      full.names = TRUE, 
      recursive = recursive, 
      pattern = pattern
    )
  }
  
  # Read and combine files
  if(fileCol){
    return(rbindlist(lapply(filepaths, function(x) fread(x, ...)[, File := x]), use.names = TRUE))
  } else{
    return(rbindlist(lapply(filepaths, fread, ...), use.names = TRUE))
  }
}
Fraenum answered 21/12, 2020 at 15:56 Comment(2)
How can I add a column with file names using this function? For example, if I have sample1.txt sample2.txt sample3.txt in my dir, I would like to read and merge them to one data table with V2 as filename (e.g. sample1). So my data would look like sample1scontent sample1Disentitle
This is often still too slow. Here are some 25-50x faster ways https://mcmap.net/q/98510/-how-to-import-multiple-csv-files-at-onceMimetic

© 2022 - 2024 — McMap. All rights reserved.