Trimming a huge (3.5 GB) csv file to read into R
Asked Answered
B

13

89

So I've got a data file (semicolon separated) that has a lot of detail and incomplete rows (leading Access and SQL to choke). It's county level data set broken into segments, sub-segments, and sub-sub-segments (for a total of ~200 factors) for 40 years. In short, it's huge, and it's not going to fit into memory if I try to simply read it.

So my question is this, given that I want all the counties, but only a single year (and just the highest level of segment... leading to about 100,000 rows in the end), what would be the best way to go about getting this rollup into R?

Currently I'm trying to chop out irrelevant years with Python, getting around the filesize limit by reading and operating on one line at a time, but I'd prefer an R-only solution (CRAN packages OK). Is there a similar way to read in files a piece at a time in R?

Any ideas would be greatly appreciated.

Update:

  • Constraints
    • Needs to use my machine, so no EC2 instances
    • As R-only as possible. Speed and resources are not concerns in this case... provided my machine doesn't explode...
    • As you can see below, the data contains mixed types, which I need to operate on later
  • Data
    • The data is 3.5GB, with about 8.5 million rows and 17 columns
    • A couple thousand rows (~2k) are malformed, with only one column instead of 17
      • These are entirely unimportant and can be dropped
    • I only need ~100,000 rows out of this file (See below)

Data example:

County; State; Year; Quarter; Segment; Sub-Segment; Sub-Sub-Segment; GDP; ...
Ada County;NC;2009;4;FIRE;Financial;Banks;80.1; ...
Ada County;NC;2010;1;FIRE;Financial;Banks;82.5; ...
NC  [Malformed row]
[8.5 Mill rows]

I want to chop out some columns and pick two out of 40 available years (2009-2010 from 1980-2020), so that the data can fit into R:

County; State; Year; Quarter; Segment; GDP; ...
Ada County;NC;2009;4;FIRE;80.1; ...
Ada County;NC;2010;1;FIRE;82.5; ...
[~200,000 rows]

Results:

After tinkering with all the suggestions made, I decided that readLines, suggested by JD and Marek, would work best. I gave Marek the check because he gave a sample implementation.

I've reproduced a slightly adapted version of Marek's implementation for my final answer here, using strsplit and cat to keep only columns I want.

It should also be noted this is MUCH less efficient than Python... as in, Python chomps through the 3.5GB file in 5 minutes while R takes about 60... but if all you have is R then this is the ticket.

## Open a connection separately to hold the cursor position
file.in <- file('bad_data.txt', 'rt')
file.out <- file('chopped_data.txt', 'wt')
line <- readLines(file.in, n=1)
line.split <- strsplit(line, ';')
# Stitching together only the columns we want
cat(line.split[[1]][1:5], line.split[[1]][8], sep = ';', file = file.out, fill = TRUE)
## Use a loop to read in the rest of the lines
line <- readLines(file.in, n=1)
while (length(line)) {
  line.split <- strsplit(line, ';')
  if (length(line.split[[1]]) > 1) {
    if (line.split[[1]][3] == '2009') {
        cat(line.split[[1]][1:5], line.split[[1]][8], sep = ';', file = file.out, fill = TRUE)
    }
  }
  line<- readLines(file.in, n=1)
}
close(file.in)
close(file.out)

Failings by Approach:

  • sqldf
    • This is definitely what I'll use for this type of problem in the future if the data is well-formed. However, if it's not, then SQLite chokes.
  • MapReduce
    • To be honest, the docs intimidated me on this one a bit, so I didn't get around to trying it. It looked like it required the object to be in memory as well, which would defeat the point if that were the case.
  • bigmemory
    • This approach cleanly linked to the data, but it can only handle one type at a time. As a result, all my character vectors dropped when put into a big.table. If I need to design large data sets for the future though, I'd consider only using numbers just to keep this option alive.
  • scan
    • Scan seemed to have similar type issues as big memory, but with all the mechanics of readLines. In short, it just didn't fit the bill this time.
Byerly answered 22/6, 2010 at 16:0 Comment(7)
If your criteria are simple enough, you can probably get away with using sed and/or awk to create a chopped-down version of the CSV that you can read in directly. Since this is more of a workaround than an answer, I'll leave it as a comment.Baptize
I agree with Hank - you should use the right tool for the job, and if it's simple data cleaning/removing irrelevant rows/columns command line stream tools like sort/sed/awk are great and are going to be way less resource intensive than R or python - if you give a sample of your files format we could probably give an exampleUncourtly
Great. Let us know what you discover.Leeland
@Hank & Aaron: I'm generally all for using the right tool for the job, but given this is on a Windows machine at work and I'm learning R as I go, I figured it would be a good exercise forego best practices and try this as R-only if possible.Byerly
It could be speed up by reading more then one line at the time (like in my code) and some other improvements in your code (change strsplit to scan). I wouldn't be surprised if R after modifications do this in 5 minutes.Bagwell
Regarding your comment about the file being malformed you might want to look at the end of FAQ #14 on the sqldf home page. That FAQ is about NA fields but also discusses other situations of where the file is malformed: code.google.com/p/sqldf/…Ancipital
For future reference, check out the data.table R package. The fread function is much faster than read.table. Use something like x = fread(file_path_here, data.table=FALSE) to load it in as a data.frame object.Penza
B
42

My try with readLines. This piece of a code creates csv with selected years.

file_in <- file("in.csv","r")
file_out <- file("out.csv","a")
x <- readLines(file_in, n=1)
writeLines(x, file_out) # copy headers

B <- 300000 # depends how large is one pack
while(length(x)) {
    ind <- grep("^[^;]*;[^;]*; 20(09|10)", x)
    if (length(ind)) writeLines(x[ind], file_out)
    x <- readLines(file_in, n=B)
}
close(file_in)
close(file_out)
Bagwell answered 24/6, 2010 at 8:45 Comment(1)
This is almost exactly what I was just writing. I sense this will also be the best answer, given memory constraints, mixed types, and malformed rows.Byerly
L
10

I'm not an expert at this, but you might consider trying MapReduce, which would basically mean taking a "divide and conquer" approach. R has several options for this, including:

  1. mapReduce (pure R)
  2. RHIPE (which uses Hadoop); see example 6.2.2 in the documentation for an example of subsetting files

Alternatively, R provides several packages to deal with large data that go outside memory (onto disk). You could probably load the whole dataset into a bigmemory object and do the reduction completely within R. See http://www.bigmemory.org/ for a set of tools to handle this.

Leeland answered 22/6, 2010 at 16:14 Comment(2)
Good suggestion, but I don't have much experience with MapReduce and its ilk. I'll have to read up on it.Byerly
bigmemory may be easier for you to try first, in that case.Leeland
C
10

Is there a similar way to read in files a piece at a time in R?

Yes. The readChar() function will read in a block of characters without assuming they are null-terminated. If you want to read data in a line at a time you can use readLines(). If you read a block or a line, do an operation, then write the data out, you can avoid the memory issue. Although if you feel like firing up a big memory instance on Amazon's EC2 you can get up to 64GB of RAM. That should hold your file plus plenty of room to manipulate the data.

If you need more speed, then Shane's recommendation to use Map Reduce is a very good one. However if you go the route of using a big memory instance on EC2 you should look at the multicore package for using all cores on a machine.

If you find yourself wanting to read many gigs of delimited data into R you should at least research the sqldf package which allows you to import directly into sqldf from R and then operate on the data from within R. I've found sqldf to be one of the fastest ways to import gigs of data into R, as mentioned in this previous question.

Conditioner answered 22/6, 2010 at 18:47 Comment(1)
I'll keep an EC2 instance in mind, but at the moment I've got to stick to my desktop and it's 2GB of RAM. sqldf definitely seems like what I had in mind. However, it chokes on the malformed rows as well (There should be 17 columns, but a couple thousand rows only have one). Does that call for some other preprocessing method, or is there an option I'm missing?Byerly
T
6

There's a brand-new package called colbycol that lets you read in only the variables you want from enormous text files:

http://colbycol.r-forge.r-project.org/

It passes any arguments along to read.table, so the combination should let you subset pretty tightly.

Tourney answered 4/5, 2012 at 12:22 Comment(0)
H
6

The ff package is a transparent way to deal with huge files.

You may see the package website and/or a presentation about it.

I hope this helps

Hightower answered 9/9, 2012 at 14:37 Comment(0)
O
6

What about using readr and the read_*_chunked family?

So in your case:

testfile.csv

County; State; Year; Quarter; Segment; Sub-Segment; Sub-Sub-Segment; GDP
Ada County;NC;2009;4;FIRE;Financial;Banks;80.1
Ada County;NC;2010;1;FIRE;Financial;Banks;82.5
lol
Ada County;NC;2013;1;FIRE;Financial;Banks;82.5

Actual code

require(readr)
f <- function(x, pos) subset(x, Year %in% c(2009, 2010))
read_csv2_chunked("testfile.csv", DataFrameCallback$new(f), chunk_size = 1)

This applies f to each chunk, remembering the col-names and combining the filtered results in the end. See ?callback which is the source of this example.

This results in:

# A tibble: 2 × 8
      County State  Year Quarter Segment `Sub-Segment` `Sub-Sub-Segment`   GDP
*      <chr> <chr> <int>   <int>   <chr>         <chr>             <chr> <dbl>
1 Ada County    NC  2009       4    FIRE     Financial             Banks   801
2 Ada County    NC  2010       1    FIRE     Financial             Banks   825

You can even increase chunk_size but in this example there are only 4 lines.

Oletta answered 2/5, 2017 at 9:50 Comment(0)
B
5

You could import data to SQLite database and then use RSQLite to select subsets.

Bagwell answered 22/6, 2010 at 16:36 Comment(1)
A good plan, but since this essentially what sqldf does behind the scenes, I'd prefer that. Unless there's a better way to handle the malformed rows if you use straight RSQLite?Byerly
D
4

Have you consisered bigmemory ? Check out this and this.

Divorce answered 22/6, 2010 at 16:30 Comment(0)
A
3

Perhaps you can migrate to MySQL or PostgreSQL to prevent youself from MS Access limitations.

It is quite easy to connect R to these systems with a DBI (available on CRAN) based database connector.

Adrianadriana answered 22/6, 2010 at 16:36 Comment(1)
Touche for using better database tools, but since that would involve an administrative hassle (gotta love those administration regulations in big companies), I'm trying to stick with what I have. Plus, I'm aiming for as few conversions between the text file I receive as possible.Byerly
B
3

scan() has both an nlines argument and a skip argument. Is there some reason you can just use that to read in a chunk of lines a time, checking the date to see if it's appropriate? If the input file is ordered by date, you can store an index that tells you what your skip and nlines should be that would speed up the process in the future.

Biddable answered 23/6, 2010 at 20:38 Comment(2)
I'll check it out, but the file isn't ordered by anything helpful like date. The providers seem to think it's more important to sort by what region a given county's in. /sigh...Byerly
I think you misunderstood his proposal: read your file chunk by chunk, and extract only the rows you need from each chunk. The files does not need to be ordered.Widgeon
F
2

These days, 3.5GB just isn't really that big, I can get access to a machine with 244GB RAM (r3.8xlarge) on the Amazon cloud for $2.80/hour. How many hours will it take you to figure out how to solve the problem using big-data type solutions? How much is your time worth? Yes it will take you an hour or two to figure out how to use AWS - but you can learn the basics on a free tier, upload the data and read the first 10k lines into R to check it worked and then you can fire up a big memory instance like r3.8xlarge and read it all in! Just my 2c.

Feck answered 12/10, 2014 at 16:20 Comment(1)
In December 2020, it's possible to get a 24 TB machine on AWS EC2! These days I regularly read in 20GB CSV files on my server, and it takes a while :). Thank goodness for the tidyverse!Feck
W
0

Now, 2017, I would suggest to go for spark and sparkR.

  • the syntax can be written in a simple rather dplyr-similar way

  • it fits quite well to small memory (small in the sense of 2017)

However, it may be an intimidating experience to get started...

Weekley answered 26/9, 2017 at 3:39 Comment(1)
(The SparklyR package makes it possible to use Sparkly without knowing much about it's syntax.)Slaver
P
-3

I would go for a DB and then make some queries to extract the samples you need via DBI

Please avoid importing a 3,5 GB csv file into SQLite. Or at least double check that your HUGE db fits into SQLite limits, http://www.sqlite.org/limits.html

It's a damn big DB you have. I would go for MySQL if you need speed. But be prepared to wait a lot of hours for the import to finish. Unless you have some unconventional hardware or you are writing from the future...

Amazon's EC2 could be a good solution also for instantiating a server running R and MySQL.

my two humble pennies worth.

Presumably answered 22/6, 2010 at 21:14 Comment(1)
How is 3.5Gb large for sqlite? As long as you are using an appropriate filesystem there should be no problem (I'm regularly using > 30Gb sqlite dbs for single user applications)Uncourtly

© 2022 - 2024 — McMap. All rights reserved.