How to deal with a 50GB large csv file in r language?

Asked 24/9, 2016 at 17:13 Answered 24/9, 2016 at 17:23

I am relatively new in the "large data process" in r here, hope to look for some advise about how to deal with 50 GB csv file. The current problem is following:

Table is looked like:

ID,Address,City,States,... (50 more fields of characteristics of a house)
1,1,1st street,Chicago,IL,...
# the first 1 is caused by write.csv, they created an index raw in the file

I would like to find all rows that is belonging San Francisco, CA. It supposed to be an easy problem, but the csv is too large.

I know I have two ways of doing it in R and another way to use database to handle it:

(1) Using R's ffdf packages:

since last time the file is saved, it was using write.csv and it contains all different types.

all <- read.csv.ffdf(
  file="<path of large file>", 
  sep = ",",
  header=TRUE, 
  VERBOSE=TRUE, 
  first.rows=10000, 
  next.rows=50000,
  )

the console gives me this:

Error in ff(initdata = initdata, length = length, levels = levels, ordered = ordered,  
: vmode 'character' not implemented

Searching through online, I found several answers which did not fit my case, and I can't really make sense of how to transfer "character" into "factor" type as they mentioned.

Then I tried using read.table.ffdf, this is even more disaster. I can't find a solid guide for that one.

(2) Using R's readline:

I know this is another good way, but can't find an effecient way to do this.

(3) Using SQL:

I am not sure how to transfer the file into SQL version, and how to handle this, if there is a good guide I would like to try. But in general, I would like to stick with R.

Thanks for reply and help!

Celery answered 24/9, 2016 at 17:13 Comment(4)

See if this answer helps. – Sherrell 24/9, 2016 at 17:22

in v. similar vein to ^^ link #26862451 – Track 24/9, 2016 at 17:28

Ill check it out thanks! – Celery 24/9, 2016 at 21:10

Possible duplicate of How do i read only lines that fulfil a condition from a csv into R? – Hooligan 9/10, 2018 at 21:11

You can use R with SQLite behind the curtains with the sqldf package. You'd use the read.csv.sql function in the sqldf package and then you can query the data however you want to obtain the smaller data frame.

The example from the docs:

library(sqldf)

iris2 <- read.csv.sql("iris.csv", 
    sql = "select * from file where Species = 'setosa' ")

I've used this library on VERY large CSV files with good results.

Larson answered 24/9, 2016 at 17:23 Comment(4)

Thanks Ill check it out! The current problem I am having also involves the extra indexing column in the big data file, is the read.csv.sql can handle it? :D – Celery 24/9, 2016 at 21:11

The current problem I am having also involves the extra indexing column in the big data file, is there a way I can fix that? right now it tells me that if from second row it has 1 more field than the first row.. – Celery 24/9, 2016 at 21:23

@Celery I'm not sure I follow exactly but it sounds like your CSV file may not have records with a consistent number or columns per record. You'll have to experiment with the data. You may want to delete that problem row but if your CSV data isn't quoted and you have commas in your address data or other fields you'll continue to have problems and may need to regenerate your data, WITH quoted fields. Hope that helps. On the read.csv.sql function there is a quote parameter you can set to T/F - depending on your data. Good luck! – Larson 24/9, 2016 at 21:36

Man, don't know what's the deal, but I have a tab separated csv and I tried to read in a single column and 2% of the rows, and it takes forever. The full file is 176M rows. Is this just the conditioning? It's slower than reading in the entire thing... – Clapper 25/7, 2019 at 21:31

R -- in its basic configuration -- loads data into memory. Memory is cheap. 50 GB still is not a typical configuration (and you would need more than that to load the data in and store it). If you are really good in R, you might be able to figure out another mechanism. If you have access to a cluster, you could use some parallel version of R or Spark.

You could also load the data into a database. For the task at hand, a database is very well suited to the problem. R easily connects to almost any database. And, you might find a database very useful for what you want to do.

Or, you could just process the text file in situ. Command line tools such as awk, grep, and perl are very suitable for this task. I would recommend this approach for a one-time effort. I would recommend a database if you want to keep the data around for analytic purposes.

Qualls answered 24/9, 2016 at 17:20 Comment(4)

I second the suggestion to use awk or grep! For single queries like this, you will be done in minutes. – Transmigrate 24/9, 2016 at 19:33

@GordonLinoff, your suggestion of R and Spark is a good one too. I've used that out on AWS to tear through ~4TB of binary files and it handled it with aplomb! – Larson 24/9, 2016 at 19:53

I got it now, just takes a while to realize it is a database-type thing, because previously, R is handling well with 10GB data tables.... And thanks a lot for commenting! – Celery 24/9, 2016 at 21:12

readLines() (line-by-line from a connection) + regex functions (standard or stringi) can work just as well and it keeps the analysis pipeline in R and works across platforms. awk (et al) may be part of the 1GB Rtools.exe download but Windows R folk tend to have the most problems with non-R solutions. – Madonia 25/9, 2016 at 2:41

Recommended topics

Hot tags