I'm most grateful for your time to read this.
I have a uber size 30GB file of 6 million records and 3000 (mostly categorical data) columns in csv format. I want to bootstrap subsamples for multinomial regression, but it's proving difficult even with my 64GB RAM in my machine and twice that swap file , the process becomes super slow and halts.
I'm thinking about generating subsample indicies in R and feeding them into a system command using sed or awk, but don't know how to do this. If someone knew of a clean way to do this using just R commands, I would be really grateful.
One problem is that I need to pick complete observations of subsamples, that is I need to have all the rows of a particular multinomial observation - they are not the same length from observation to observation. I plan to use glmnet and then some fancy transforms to get an approximation to the multinomial case. One other point is that I don't know how to choose sample size to fit around memory limits.
Appreciate your thoughts greatly.
R.version
platform x86_64-pc-linux-gnu
arch x86_64
os linux-gnu
system x86_64, linux-gnu
status
major 2
minor 15.1
year 2012
month 06
day 22
svn rev 59600
language R
version.string R version 2.15.1 (2012-06-22)
nickname Roasted Marshmallows
Yoda
data.table
(1.8.7) and the functionfread
which will be faster than the most optimized version ofread.table
. data.tables will also be good for data the size you are considering – Everrs