Big data read subsamples R

Asked 16/8, 2012 at 13:49 Answered 17/8, 2012 at 19:57

Solved linux r awk system statistics-bootstrap

I'm most grateful for your time to read this.

I have a uber size 30GB file of 6 million records and 3000 (mostly categorical data) columns in csv format. I want to bootstrap subsamples for multinomial regression, but it's proving difficult even with my 64GB RAM in my machine and twice that swap file , the process becomes super slow and halts.

I'm thinking about generating subsample indicies in R and feeding them into a system command using sed or awk, but don't know how to do this. If someone knew of a clean way to do this using just R commands, I would be really grateful.

One problem is that I need to pick complete observations of subsamples, that is I need to have all the rows of a particular multinomial observation - they are not the same length from observation to observation. I plan to use glmnet and then some fancy transforms to get an approximation to the multinomial case. One other point is that I don't know how to choose sample size to fit around memory limits.

Appreciate your thoughts greatly.

R.version
platform       x86_64-pc-linux-gnu          
arch           x86_64                       
os             linux-gnu                    
system         x86_64, linux-gnu            
status                                      
major          2                            
minor          15.1                         
year           2012                         
month          06                           
day            22                           
svn rev        59600                        
language       R                            
version.string R version 2.15.1 (2012-06-22)
nickname       Roasted Marshmallows

Yoda

Thaumaturgy answered 16/8, 2012 at 13:49 Comment(1)

You can use the development version of data.table (1.8.7) and the function fread which will be faster than the most optimized version of read.table. data.tables will also be good for data the size you are considering – Everrs 11/2, 2013 at 2:42

I think it's an exceedingly terrible idea to use CSV as your data format for such file sizes - why not transform it into a SQLite (or "actual" database) and extract your subsets with SQL queries (using DBI/RSQLite2)?

You need to import only once, and there is no need to load the entire thing into memory because you can directly import CSV files into sqlite.

If in general you want to work with datasets that are larger than your memory, you might also want to have a look at bigmemory.

Greeneyed answered 16/8, 2012 at 14:23 Comment(1)

Thanks, I have the file in a database now. Will make subsample indicies in R and then select the relevant rows using SQL statements to pick each subsample. – Thaumaturgy 18/8, 2012 at 14:55

As themel has pointed out, R is very very slow on reading csv files.
If you have sqlite, it really is the best approach, as it appears the data mining is not just for one time, but over multiple session, in multiple ways.

Lets look at the options we have

Reading the csv to R (slow)

Doing this in R is like 20 times slow, compared to a tool written in C (on my machine)

This is very slow

read.csv( file='filename.csv' , head=TRUE , sep=",")

Convert to stata dta file beforehand and load from there

Not that great, but it should work (I have never tried it on a 30 gig file, so I can not say for sure)

Writing a program to convert csv into dta format(If you know what you are doing)

Using the resource from http://www.stata.com/help.cgi?dta and code from https://svn.r-project.org/R-packages/trunk/foreign/src/stataread.c to read and write and http://sourceforge.net/projects/libcsv/
(It has been done in the past. However I have not used it so I do not know how well it performs)

Then using the foreign package (http://cran.r-project.org/web/packages/foreign/index.html), a simple

library(foreign)
whatever <- read.dta("file.dta")

would load your data

Using mysql directly to import csv data (Hard to use, however it is not that bad if you know SQL)

From s SQL console

LOAD DATA LOCAL INFILE 'file.csv' INTO TABLE my_table 
IGNORE 1 LINES              <- If csv file contains headers
FIELDS TERMINATED BY ',' ENCLOSED BY '"' ESCAPED BY '\\' LINES TERMINATED BY '\n'

mysql -e "LOAD DATA INFILE 'ls.dat' INTO TABLE mytable1" mydatabase

Then play from the R console, using RMySQL R interface to the MySQL database http://cran.r-project.org/web/packages/RMySQL/index.html

install.packages('RMySQL')

Then play around like

mydb = dbConnect(MySQL(), user=username, password=userpass, dbname=databasename, host=host)
dbListTables(mydb)
record <- dbSendQuery(mydb, "select * from whatever")
dbClearResult(rs)
dbDisconnect(mydb)

Using R to do all the sqlite/postgreSQL/MySQL backend SQL stuff to import csv (Reccomended)

Download, from https://code.google.com/p/sqldf/ if you do not have the package
or svn checkout http://sqldf.googlecode.com/svn/trunk/ sqldf-read-only

From the R console,

install.packages("sqldf")
# shows built in data frames
data() 

# load sqldf into workspace
library(sqldf)
MyCsvFile <- file("file.csv")
Mydataframe <- sqldf("select * from MyCsvFile", dbname = "MyDatabase", file.format = list(header = TRUE, row.names = FALSE))

And off you go!

Presonally, I would recomend the library(sqldf) option :-)

Encrimson answered 17/8, 2012 at 19:57 Comment(3)

Tried sqldf, but my file has embedded delimiters, which causes errors. From the documentation, it seems that sqldf can't handle this. – Thaumaturgy 18/8, 2012 at 14:57

Yes, that is the reason I wrote about Using mysql directly to import cvs data, so you can adjust FIELDS TERMINATED BY ',' ENCLOSED BY '"' ESCAPED BY '\\' LINES TERMINATED BY '\n' to whatever embedded delimiter you are using. There is no known well defined fromat for cvs, so a bit of fiddling around might be required. If you could provide some sample data( few lines ) it would be easier for people in SO to answer :-) – Encrimson 18/8, 2012 at 23:7

Having established connections to multiple DBs using RMySQL, how can I query on tables present in different DBs? dbGetQuery seems to allow only one connection in every query. – Uno 19/11, 2013 at 14:0