I'm trying to input a large tab-delimited file (around 2GB) using the fread
function in package data.table
. However, because it's so large, it doesn't fit completely in memory. I tried to input it in chunks by using the skip
and nrow
arguments such as:
chunk.size = 1e6
done = FALSE
chunk = 1
while(!done)
{
temp = fread("myfile.txt",skip=(chunk-1)*chunk.size,nrow=chunk.size-1)
#do something to temp
chunk = chunk + 1
if(nrow(temp)<2) done = TRUE
}
In the case above, I'm reading in 1 million rows at a time, performing a calculation on them, and then getting the next million, etc. The problem with this code is that after every chunk is retrieved, fread
needs to start scanning the file from the very beginning since after every loop iteration, skip
increases by a million. As a result, after every chunk, fread
takes longer and longer to actually get to the next chunk making this very inefficient.
Is there a way to tell fread
to pause every say 1 million lines, and then continue reading from that point on without having to restart at the beginning? Any solutions, or should this be a new feature request?