The ddply
and ave
approaches are both fairly resource-intensive, I think. ave
fails by running out of memory for my current problem (67,608 rows, with four columns defining the unique keys). tapply
is a handy choice, but what I generally need to do is select all the whole rows with the something-est some-value for each unique key (usually defined by more than one column). The best solution I've found is to do a sort and then use negation of duplicated
to select only the first row for each unique key. For the simple example here:
a <- sample(1:10,100,replace=T)
b <- sample(1:100,100,replace=T)
f <- data.frame(a, b)
sorted <- f[order(f$a, -f$b),]
highs <- sorted[!duplicated(sorted$a),]
I think the performance gains over ave
or ddply
, at least, are substantial. It is slightly more complicated for multi-column keys, but order
will handle a whole bunch of things to sort on and duplicated
works on data frames, so it's possible to continue using this approach.