Read in data in chunks via fread in R
Asked Answered
P

0

6

I am trying to read a huge dataset (> 25GB) in R. It doesn't fit in my PCs memory. But I think maybe it would be possible to work with the data, if it were in compressed (RData) format. As part of the process, I also have to change column classes to character, because some of the columns contain a mix of numbers and strings. How can this be done in fread?

The first part is always easy, we can use e.g. the following code to read 10m rows:

part1 <- fread(filename,sep = " ", stringsAsFactors = FALSE, header = TRUE, 
      nrows = 10000000,showProgress = TRUE, 
      colClasses=c(AA="character",BB="character"))

But if I try to read the second part of the file, I always get an error. I am using the following code to skip the first 1000000 rows, that were already read in before. Also I am setting header to FALSE

part2 <- fread(filename,sep = " ", stringsAsFactors = FALSE, header = FALSE, 
      nrows = 10000000,skip=10000000,showProgress = TRUE, 
      colClasses=c(AA="character",BB="character"))

The error msg is:

Error in fread()  :  Column name 'AA' in colClasses[[1]] not found

Note: If we set header = TRUE, the error message still occurs.

I cannot give an example data set in this size, but I guess the problem is simply that the column names are missing if we use skip, and then they cannot be used when we set colClasses. Is there any way to use fread or do I have to use other packages?

Polypeptide answered 22/11, 2017 at 4:43 Comment(6)
have you tried reading the data without changing column names?Glennglenna
yep, I think you mean changing column classes I guess? I am getting errors like: Bumped column 12 to type character on data row 2647747, field contains 'AC'... It can actually be harmful, because all previously read in values are then simply converted to character (not necessarily lossless)Polypeptide
can you please run this code and check if you get any error part1 <- fread(filename,sep = " ", stringsAsFactors = FALSE, header = TRUE, nrows = 2647746, showProgress = TRUE, colClasses=c(AA="character",BB="character"))Glennglenna
That is the same code as before, but lower nrows right? I dont expect an error because the code also runw with larger nrows. But I can try.Polypeptide
yes please try, I am suspecting something wrong at row 2647747Glennglenna
no it works fine. There is just a character ('AC'), whereas before it was only numbers from what I have seen. By the way, the problem also exists if you try to use select in combination with skip in freadPolypeptide

© 2022 - 2024 — McMap. All rights reserved.