Reading in chunks at a time using fread in package data.table
Asked Answered
L

4

33

I'm trying to input a large tab-delimited file (around 2GB) using the fread function in package data.table. However, because it's so large, it doesn't fit completely in memory. I tried to input it in chunks by using the skip and nrow arguments such as:

chunk.size = 1e6
done = FALSE
chunk = 1
while(!done)
{
    temp = fread("myfile.txt",skip=(chunk-1)*chunk.size,nrow=chunk.size-1)
    #do something to temp
    chunk = chunk + 1
    if(nrow(temp)<2) done = TRUE
}

In the case above, I'm reading in 1 million rows at a time, performing a calculation on them, and then getting the next million, etc. The problem with this code is that after every chunk is retrieved, fread needs to start scanning the file from the very beginning since after every loop iteration, skip increases by a million. As a result, after every chunk, fread takes longer and longer to actually get to the next chunk making this very inefficient.

Is there a way to tell fread to pause every say 1 million lines, and then continue reading from that point on without having to restart at the beginning? Any solutions, or should this be a new feature request?

Landmark answered 10/11, 2013 at 19:38 Comment(7)
There's a similar FR here. I'll also link to this post.Butterfield
Thanks for pointing this out and linking! Looks like a top priority FR.Landmark
I wanted to do the same thing I think it needs to be a new request.Glorious
@Butterfield Is there a FR on the new github page? I can't seem to find itKarilla
is this possible by now?Accustomed
The other replies don't answer the Q as asked, which wants to use "fread in package data.table" presumably for it other great benefits. A reply which might arguably come closer to the OP's intent is to suggest a solution using with Gnu parallel which can chunk the data as desired including repeating the heading row for each chunk, but it would suffer from (a) overhead of multiple R processes (b) possibly having to re-infer the data-types for each column, potentially differently between processes, (c) you gotta learn GNU parallel (well worth it).Tefillin
@Landmark fread now accepts text as input, so you could either use fread or readLines to read in your chunk and use that as input to fread (again).Soliz
D
22

You should use the LaF package. This introduces a sort of pointer on your data, thus avoiding the - for very large data - annoying behaviour of reading the whole file. As far as I get it fread() in data.table pckg need to know total number of rows, which takes time for GB data. Using pointer in LaF you can go to every line(s) you want; and read in chunks of data that you can apply your function on, then move on to next chunk of data. On my small PC I ran trough a 25 GB csv-file in steps of 10e6 lines and extracted the totally ~5e6 observations needed - each 10e6 chunk took 30 seconds.

UPDATE:

library('LaF')
huge_file <- 'C:/datasets/protein.links.v9.1.txt'

#First detect a data model for your file:
model <- detect_dm_csv(huge_file, sep=" ", header=TRUE)

Then create a connection to your file using the model:

df.laf <- laf_open(model)

Once done you can do all sort of things without needing to know the size of the file as in data.table pckgs. For instance place the pointer to line no 100e6 and read 1e6 lines of data from here:

goto(df.laf, 100e6)
data <- next_block(df.laf,nrows=1e6)

Now data contains 1e6 lines of your CSV file (starting from line 100e6).

You can read in chunks of data (size depending on your memory) and only keep what you need. e.g. the huge_file in my example points to a file with all known protein sequences and has a size of >27 GB - way to big for my PC. To get only human sequence I filtered using organism id which is 9606 for human, and this should appear in start of the variable protein1. A dirty way is to put it into a simple for-loop and just go read one data chunk at a time:

library('dplyr')
library('stringr')

res <- df.laf[1,][0,]
for(i in 1:10){
  raw <-
    next_block(df.laf,nrows=100e6) %>% 
    filter(str_detect(protein1,"^9606\\."))
  res <- rbind(res, raw)

    }

Now res contains the filtered human data. But better - and for more complex operations, e.g. calculation on data on-the-fly - the function process_blocks() takes as argument a function. Hence in the function you do what ever you want at each piece of data. Read the documentation.

Dasya answered 17/12, 2014 at 21:17 Comment(2)
Yeah, please se the UPDATE.Dasya
Thanks for this. I had an 872493862 line 61GB file and it worked reasonably fast. I tried the same looping approach with fread() using "nrows" and "skip" but it became slower and slower in each loop as it had to skip more rows.Crossway
F
10

You can use readr's read_*_chunked to read in data and e.g. filter it chunkwise. See here and here for an example:

# Cars with 3 gears
f <- function(x, pos) subset(x, gear == 3)
read_csv_chunked(readr_example("mtcars.csv"), DataFrameCallback$new(f), chunk_size = 5)
Fancher answered 2/5, 2017 at 9:53 Comment(1)
I tried this approach but for my 61GB file it was too slow.Crossway
A
8

A related option is the chunked package. Here is an example with a 3.5 GB text file:

library(chunked)
library(tidyverse)

# I want to look at the daily page views of Wikipedia articles
# before 2015... I can get zipped log files
# from here: hhttps://dumps.wikimedia.org/other/pagecounts-ez/merged/2012/2012-12/
# I get bz file, unzip to get this: 

my_file <- 'pagecounts-2012-12-14/pagecounts-2012-12-14'

# How big is my file?
print(paste(round(file.info(my_file)$size  / 2^30,3), 'gigabytes'))
# [1] "3.493 gigabytes" too big to open in Notepad++ !
# But can read with 010 Editor

# look at the top of the file 
readLines(my_file, n = 100)

# to find where the content starts, vary the skip value, 
read.table(my_file, nrows = 10, skip = 25)

This is where we start working in chunks of the file, we can use most dplyr verbs in the usual way:

# Let the chunked pkg work its magic! We only want the lines containing 
# "Gun_control". The main challenge here was identifying the column
# header
df <- 
read_chunkwise(my_file, 
               chunk_size=5000,
               skip = 30,
               format = "table",
               header = TRUE) %>% 
  filter(stringr::str_detect(De.mw.De.5.J3M1O1, "Gun_control"))

# this line does the evaluation, 
# and takes a few moments...
system.time(out <- collect(df))

And here we can work on the output as usual, since it's much smaller than the input file:

# clean up the output to separate into cols, 
# and get the number of page views as a numeric
out_df <- 
out %>% 
  separate(De.mw.De.5.J3M1O1, 
           into = str_glue("V{1:4}"),
           sep = " ") %>% 
  mutate(V3 = as.numeric(V3))

 head(out_df)
    V1                                                        V2   V3
1 en.z                                               Gun_control 7961
2 en.z Category:Gun_control_advocacy_groups_in_the_United_States 1396
3 en.z          Gun_control_policy_of_the_Clinton_Administration  223
4 en.z                            Category:Gun_control_advocates   80
5 en.z                         Gun_control_in_the_United_Kingdom   68
6 en.z                                    Gun_control_in_america   59
                                                                                 V4
1 A34B55C32D38E32F32G32H20I22J9K12L10M9N15O34P38Q37R83S197T1207U1643V1523W1528X1319
2                                     B1C5D2E1F3H3J1O1P3Q9R9S23T197U327V245W271X295
3                                     A3B2C4D2E3F3G1J3K1L1O3P2Q2R4S2T24U39V41W43X40
4                                                            D2H1M1S4T8U22V10W18X14
5                                                             B1C1S1T11U12V13W16X13
6                                                         B1H1M1N2P1S1T6U5V17W12X12

#--------------------
Apocynthion answered 1/3, 2018 at 15:1 Comment(2)
Seems chunked it's a wrapper around LaF.Milliary
Yep, though more user-friendly IMOApocynthion
C
7

fread() can definitely help you read the data by chunks

What mistake you have made in your code is that you should keep your nrow a constant while you change the size of your skip parameter in the function during the loop.

Something like this is what I wrote for my data:

data=NULL

for (i in 0:20){
    data[[i+1]]=fread("my_data.csv",nrow=10000,select=c(1,2:100),skip =10000*i)   
}

And you may insert the follow code in your loop:

start_time <- Sys.time()
#####something!!!!

end_time <- Sys.time()

end_time - start_time

to check the time -- that each loop on average takes similar time.

Then you could use another loop to combine your data by rows with function default rbind function in R.

The sample code could be something like this:

new_data = data[[1]]

for (i in 1:20){
    new_data=rbind(new_data,data[[i+1]],use.names=FALSE)
}

to unify into a large dataset.

Hope my answer may help with your question.

I loaded a 18Gb data with 2k+ columns, 200k rows in about 8 minutes using this method.

Crosscheck answered 5/2, 2020 at 23:26 Comment(1)
you are my hero, I used the argument rnows but it was nrowForm

© 2022 - 2024 — McMap. All rights reserved.