I am relatively new in the "large data process" in r here, hope to look for some advise about how to deal with 50 GB csv file. The current problem is following:
Table is looked like:
ID,Address,City,States,... (50 more fields of characteristics of a house)
1,1,1st street,Chicago,IL,...
# the first 1 is caused by write.csv, they created an index raw in the file
I would like to find all rows that is belonging San Francisco, CA. It supposed to be an easy problem, but the csv is too large.
I know I have two ways of doing it in R and another way to use database to handle it:
(1) Using R's ffdf packages:
since last time the file is saved, it was using write.csv and it contains all different types.
all <- read.csv.ffdf(
file="<path of large file>",
sep = ",",
header=TRUE,
VERBOSE=TRUE,
first.rows=10000,
next.rows=50000,
)
the console gives me this:
Error in ff(initdata = initdata, length = length, levels = levels, ordered = ordered,
: vmode 'character' not implemented
Searching through online, I found several answers which did not fit my case, and I can't really make sense of how to transfer "character" into "factor" type as they mentioned.
Then I tried using read.table.ffdf, this is even more disaster. I can't find a solid guide for that one.
(2) Using R's readline:
I know this is another good way, but can't find an effecient way to do this.
(3) Using SQL:
I am not sure how to transfer the file into SQL version, and how to handle this, if there is a good guide I would like to try. But in general, I would like to stick with R.
Thanks for reply and help!