Subsetting ffdf objects in R
Asked Answered
rff
M

3

10

I'm using R's ff package and I've got some ffdf objects (dimensions around 1.5M x 80) that I need to work with. I'm having some trouble getting my head around the efficient slicing/dicing operations though.

For instance I've got two integer columns named "YEAR" and "AGE", and I want to make a table of AGE when the YEAR is 2005.

One approach is this:

ffwhich <- function(x, expr) {
  b <- bit(nrow(x))
  for(i in chunk(x)) b[i] <- eval(substitute(expr), x[i,])
  b
}
bw <- ffwhich(a.fdf, YEAR==1999)
answer <- table(a.fdf[bw, "AGE"])

The table() operation is fast but building the bit vector is quite slow. Anyone have any recommendations for doing this better?

Mcgrew answered 3/12, 2010 at 20:32 Comment(0)
G
1

The package ffbase provides many base functions for ff/ffdf objects, including subset.ff. With a bit of limited testing, it seems that subset.ff is relatively fast. Try loading ffbase and then using the simpler code you suggested from a previous comment (with(subset(a.fdf, YEAR==1999)).

Grams answered 13/6, 2013 at 15:39 Comment(0)
F
0

Not familiar with manipulating ff objects, but the problem you describe sounds like a classic tapply() task:

answer <- tapply(a.fdf$YEAR[a.fdf$YEAR == 1995], a.fdf$AGE[a.fdf$YEAR == 1995], length)

I would assume something like that would move faster than the two-step solution you give above, but maybe I'm misunderstanding how ff data structures work?

Florentinoflorenza answered 4/12, 2010 at 3:21 Comment(1)
If it weren't ff, I could do something much simpler, like with(subset(a.fdf, YEAR==1999), table(AGE)). ff is the part that makes it trickier.Mcgrew
C
0

My approach would be something like this:

system.time({ 
 index <- as.ff( which( a.fdf[,'Location'] == 'exonic') ); 
 table(a.fdf[index,][,'Function']);
});                                                                                             
user  system elapsed 
1.128   0.172   1.317 

Seems to be significantly faster than:

system.time({
 bw <- ffwhich(a.fdf, Location=="exonic");  
 table(a.fdf[bw,'Function']);
})
user  system elapsed 
24.901   0.208  25.150

YMMV, as these are factors, not characters, and my ffdf is ~4.3M * 42.

identical(table(a.fdf[bw,'Function']), table(a.fdf[index,][,'Function']));
[1] TRUE
Conventioner answered 14/8, 2013 at 18:41 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.