Read in data in chunks via fread in R

I am trying to read a huge dataset (> 25GB) in R. It doesn't fit in my PCs memory. But I think maybe it would be possible to work with the data, if it were in compressed (RData) format. As part of the process, I also have to change column classes to character, because some of the columns contain a mix of numbers and strings. How can this be done in fread?

The first part is always easy, we can use e.g. the following code to read 10m rows:

part1 <- fread(filename,sep = " ", stringsAsFactors = FALSE, header = TRUE, 
      nrows = 10000000,showProgress = TRUE, 
      colClasses=c(AA="character",BB="character"))

But if I try to read the second part of the file, I always get an error. I am using the following code to skip the first 1000000 rows, that were already read in before. Also I am setting header to FALSE

part2 <- fread(filename,sep = " ", stringsAsFactors = FALSE, header = FALSE, 
      nrows = 10000000,skip=10000000,showProgress = TRUE, 
      colClasses=c(AA="character",BB="character"))

The error msg is:

Error in fread()  :  Column name 'AA' in colClasses[[1]] not found

Note: If we set header = TRUE, the error message still occurs.

I cannot give an example data set in this size, but I guess the problem is simply that the column names are missing if we use skip, and then they cannot be used when we set colClasses. Is there any way to use fread or do I have to use other packages?

Recommended topics

Hot tags