I am trying to read a huge dataset (> 25GB) in R. It doesn't fit in my PCs memory. But I think maybe it would be possible to work with the data, if it were in compressed (RData) format. As part of the process, I also have to change column classes to character, because some of the columns contain a mix of numbers and strings. How can this be done in fread
?
The first part is always easy, we can use e.g. the following code to read 10m rows:
part1 <- fread(filename,sep = " ", stringsAsFactors = FALSE, header = TRUE,
nrows = 10000000,showProgress = TRUE,
colClasses=c(AA="character",BB="character"))
But if I try to read the second part of the file, I always get an error. I am using the following code to skip the first 1000000 rows, that were already read in before. Also I am setting header
to FALSE
part2 <- fread(filename,sep = " ", stringsAsFactors = FALSE, header = FALSE,
nrows = 10000000,skip=10000000,showProgress = TRUE,
colClasses=c(AA="character",BB="character"))
The error msg is:
Error in fread() : Column name 'AA' in colClasses[[1]] not found
Note: If we set header = TRUE
, the error message still occurs.
I cannot give an example data set in this size, but I guess the problem is simply that the column names are missing if we use skip
, and then they cannot be used when we set colClasses
. Is there any way to use fread or do I have to use other packages?
Bumped column 12 to type character on data row 2647747, field contains 'AC'...
It can actually be harmful, because all previously read in values are then simply converted to character (not necessarily lossless) – Polypeptidepart1 <- fread(filename,sep = " ", stringsAsFactors = FALSE, header = TRUE, nrows = 2647746, showProgress = TRUE, colClasses=c(AA="character",BB="character"))
– Glennglennaselect
in combination withskip
infread
– Polypeptide