How to append several large data.table objects into a single data.table and export to csv quickly without running out of memory?
Asked Answered
Y

1

8

The easy answer to this is "buy more RAM" but I am hoping to get a more constructive answer and learn something in the process.

I am running Windows 7 64-bit with 8GB of RAM.

I have several very large .csv.gz files (~450MB uncompressed) with the same exact header information that I read into R and perform some processing on. Then, I need to combine the processed R objects into a single master object and write back out to .csv on disk.

I do this same operation on multiple sets of files. As an example, I have 5 folders each with 6 csv.gz files in them. I need to end up with 5 master files, one for each folder.

My code looks something like the following:

for( loop through folders ){
    master.file = data.table()

    for ( loop through files ) {
        filename = list.files( ... )
        file = as.data.table ( read.csv( gzfile( filename ), stringsAsFactors = F ))
        gc()

        ...do some processing to file...

        # append file to the running master.file
        if ( nrow(master.file) == 0 ) {
            master.file = file
        } else {
            master.file = rbindlist( list( master.file, file) )
        }
        rm( file, filename )
        gc()
    }

    write.csv( master.file, unique master filename, row.names = FALSE )

    rm( master.file )
    gc()

}

This code does not work. I get the cannot allocate memory error before it writes out the final csv. I was watching resource monitor while running this code and don't understand why it would be using 8GB of RAM to do this processing. The total of all the file sizes is roughly 2.7GB, so I was expecting that the maximum memory R would use is 2.7GB. But the write.csv operation seems to use the same amount of memory as the data object you are writing, so if you have a 2.7GB object in memory and try to write it out, you would be using 5.6 GB of memory.

This apparent reality, combined with using a for loop in which memory doesn't seem to be getting adequately freed up seems to be the problem.

I suspect that I could use the sqldf package as mentioned here and here but when I set the sqldf statement equal to an R variable I ended up with the same out of memory errors.

Yardarm answered 20/12, 2013 at 22:34 Comment(3)
The general rule is that you should have 3x the size of your largest object. (So you've already violated that rule.) Furthermore your may have 8MB of RAM but you need to subtract the RAM you use for OS and otehr applications and background utilities.Piefer
Why don't you use write.table(yourFirstDataTable, sep = ",", file = YourFile.csv) for the first file read in and processed, and then write.table(yourOtherDataTables, sep = ",", file = YourFile.csv, append = TRUE, col.names = FALSE)?Sparke
@AnandaMahto that is an excellent suggestion! I forgot that there was an append option to write.table. write.csv disables that option but with write.table, I don't have to append the data in R, I can just append each new object to the file on disk.Yardarm
Y
5

Update 12/23/2013 - The following solution works all in R without running out of memory (Thanks @AnandaMahto).
The major caveat with this method is that you must be absolutely sure that the files you reading in and writing out each time have exactly the same header columns, in exactly the same order, or your R processing code must ensure this since write.table does not check this for you.

for( loop through folders ){

    for ( loop through files ) {

        filename = list.files( ... )
        file = as.data.table ( read.csv( gzfile( filename ), stringsAsFactors = F ))
        gc()

        ...do some processing to file...

        # append file to the running master.file
        if ( first time through inner loop) {
            write.table(file, 
                        "masterfile.csv", 
                        sep = ",", 
                        dec = ".", 
                        qmethod = "double", 
                        row.names = "FALSE")
        } else {
            write.table(file,
                        "masterfile.csv",
                        sep = ",",
                        dec = ".",
                        qmethod = "double",
                        row.names = "FALSE",
                        append = "TRUE",
                        col.names = "FALSE")
        }
        rm( file, filename )
        gc()
    }
    gc()
}

My Initial Solution:

for( loop through folders ){

    for ( loop through files ) {
        filename = list.files( ... )
        file = as.data.table ( read.csv( gzfile( filename ), stringsAsFactors = F ))
        gc()

        ...do some processing to file...

        #write out the file
        write.csv( file, ... )
        rm( file, filename )
        gc()
    }        
    gc()
}

I then downloaded and installed GnuWin32's sed package and used Windows command line tools to append the files as follows:

copy /b *common_pattern*.csv master_file.csv

This appends together all of the individual .csv files whose names have the text pattern "common_pattern" in them, headers and all.

Then I use sed.exe to remove all but the first header line as follows:

"c:\Program Files (x86)\GnuWin32\bin\sed.exe" -i 2,${/header_pattern/d;} master_file.csv

-i tells sed to just overwrite the specified file (in-place).

2,$ tells sed to look at range from the 2nd row to the last row ($)

{/header_pattern/d;} tells sed to find all lines in the range with the text "header_pattern" in them and d delete these lines

In order to make sure this was doing what I wanted it to do, I first printed the lines I was planning to delete.

"c:\Program Files (x86)\GnuWin32\bin\sed.exe" -n 2,${/header_pattern/p;} master_file.csv

Works like a charm, I just wish I could do it all in R.

Yardarm answered 20/12, 2013 at 22:34 Comment(23)
I only skimmed the question and answer so far ... how about fread?Baler
I looked up fread and it had a note that it "wasn't for production use yet" so I passed on it.Yardarm
fread() works fine in my experience. Make sure you have virtual memory turned on with something like memory.limit(size = 6 * 8192). This is slow but works. (And add 'buy an SSD' to your 'buy RAM' list ;)).Graben
FYI, calling gc() explicitly achieves nothingDowzall
@Graben How much RAM do you have and what size file are you loading? I suspect it's quite different to the numbers in the question but your comment is quite general about fread?Baler
@BrianD I'd try fread in this case. Nothing to lose. The note about non-production use is to do with building code dependencies in your code (arguments may still possibly change in future). It isn't unstable in the non-working sense.Baler
@Dowzall 'Achieves nothing' seems a little strong. Never, ever? How have you measured what gc() has done? You refer in that article 'despite what you have read elsewhere' but don't go into detail - are they less intelligent than you or simply all mistaken? There are different levels of garbage collection I believe, a deeper one is triggered every 20 (iirc) - were you aware of that? Are you absolutely sure about what you write?Baler
@MattDowle My comment about fread() was that it has always worked fine for me, and is not linked to the amount of RAM. My problem was similar to that in the question, just a bit larger (25 tab-separated files, 220-450MB uncompressed, ~ 9GB total, also on Win 7 64-bit, on an Intel Core 2 Quad Q6600 with 8GB RAM). In his case probably 6*RAM is overkill by far (I just copy&pasted, and I need more RAM because I was reshaping and doing other stuff after the importing). I did everything in R as in my case the files had different column order.Graben
@Dowzall I do not know how gc() works exactly (will read what you linked to) but my informal impression was that it was never triggered automatically. Code where I used it manually after every RAM-intensive operation did seem to work faster. Maybe because of the virtual memory I had to use? Too bad I moved to a machine with 32GB of RAM and cannot provide exact tests to support this...Graben
@Graben gc() is always triggered automatically when needed. Otherwise R would run out of memory very very quickly. If you're seeing any differences, it's probably the placebo effect.Dowzall
@MattDowle in this case I think the burden of proof is on the accuser: you need to provide evidence that R's garbage collector is broken and needs to be run manually.Dowzall
@Dowzall The manual ?gc says that it triggers a collect and also that it can be useful to call manually. Are you saying the manual is wrong? If so, how do you suggest it be improved?Baler
@Graben If you were loading 9GB into 8GB then of course you'll have been swapping and that's what lead you to recommend SSD. That's nothing to do with fread. R runs within RAM. If you go larger than RAM then you'll swap and that's slow. Your original comment seems to be about R (and any in-RAM software) nothing to do with fread. fread is intended to load data into RAM. It doesn't help for data larger than RAM. RAM is very cheap; e.g. some data.table users have 512GB of RAM (0.5TB). Glad you have now quadrupled to 32GB.Baler
@MattDowle Yes, of course you are absolutely correct, and I am aware of what you write. The 'slow' part referred to using virtual memory, not to fread. Maybe I should have posted the virtual memory part under the question, not under the solution. Unfortunately I cannot edit, just delete it now (and will do so if you want me to). data.table has saved me a lot of time, a big "thank you" to you! The SSD recommendation was an intermediate solution, my old PC already had the max supported 8GB of RAM, and it was cheaper to get an SSD than a whole new PC (which I eventually could afford/justify).Graben
@MattDowle It always says "This will also take place automatically without user intervention, and the primary purpose of calling gc is for the report on memory usage", and calling it only "may prompt R to return memory to the operating system" (which is a pretty weak claim). So I think the help is balanced.Dowzall
@Dowzall This seems to run noticeably faster with manual gc(). Just tested on a virtual machine (Win XP 32 bit, 512MB RAM + 3GB swap file). The number of lines fread reads is important to generate the different times. I do not know whether due to R- or to OS-specific stuff. Could be I am doing something wrong (the code is far from pretty and efficient) but does not look like placebo. Maybe suboptimal threshold to run gc() in a low-RAM, high-swap-file scenario?Graben
@Hadley If you're claiming that the manual for ?gc makes a weak claim then it's for you to show that. From what I can see, you've just agreed that calling gc() may do something sometimes. This is all that's needed to counter your (strong) statement that "calling gc() explicitly achieves nothing". Since you're a scientist I'm challenging your scientific reasoning in this case. You haven't answered my question as to whether you were aware of the different levels of garbage collection and that a deeper collect happens every 20 (iirc). Do you have any support from anyone else for your claim?Baler
@Graben Interesting. In R internals the "large vector heap" is different to small vectors (iirc something like 4k bytes, but don't quote me). I know very little about this area but this may explain why you noticed it depends on the number of rows in the file; i.e., so that vectors (the columns) are big enough to be allocated on the large vector heap. Further settings to fine tune the garbage collection trigger point are in ?Memory.Baler
@BrianD To avoid copy&paste of code (the write.table() part is almost identical) you can define a variable if(first time through inner loop) append.to.file<-TRUE else append.to.file<-FALSE and call write.table(append=append.to.file,...). Easier to read, and also if you decide to change the output format in the future.Graben
@Dowzall for some of my R code, I use an explicit call to gc to make sure R doesn't hog up resources (for other processes to run smoothly in parallel with R). This is a different question from "does calling gc add value when you run R and only R", but is still an important use of gc for my needs.Aforesaid
@Aforesaid that's a valid reason, but I believe will only work on some operating systems.Dowzall
@Dowzall it works on both linux and windows for me (a quick test is to check R memory usage after creating and deleting a large object and running gc afterwards)Aforesaid
Consider using the new fwrite, see 31. in the news for 1.9.7 and installation instructions for the latest versionMastic

© 2022 - 2024 — McMap. All rights reserved.