select rows with largest value of variable within a group in r
Asked Answered
S

6

9
a.2<-sample(1:10,100,replace=T)
b.2<-sample(1:100,100,replace=T)
a.3<-data.frame(a.2,b.2)

r<-sapply(split(a.3,a.2),function(x) which.max(x$b.2))

a.3[r,]

returns the list index, not the index for the entire data.frame

Im trying to return the largest value of b.2 for each subgroup of a.2. How can I do this efficiently?

Scarletscarlett answered 12/5, 2010 at 19:35 Comment(1)
melt(a.3,id=c("a.2"))->h.2; cast(h.2,a.2~,max) DOes the trick in this example but the computer runs out of memory when I apply it to my original dataset. So didnt really help me much.Scarletscarlett
D
6
a.2<-sample(1:10,100,replace=T)
b.2<-sample(1:100,100,replace=T)
a.3<-data.frame(a.2,b.2)

The answer by Jonathan Chang gets you what you explicitly asked for, but I'm guessing that you want the actual row from the data frame.

sel <- ave(b.2, a.2, FUN = max) == b.2
a.3[sel,]
Danais answered 12/5, 2010 at 23:35 Comment(2)
That was much simpler I must admit.. However the logic behind the == b.2 is beyond me...Scarletscarlett
The ave generates a vector that just contains the max of b.2 for every a.2. Therefore, where it == b.2 that sets a truth value as long as the data frame has rows. You're using the logical vector to select rows in the data frame. If you want to see how it's working add the result of the ave command to your data frame and look at it, comparing to the b.2 column -- a.3$b.max <- ave(b.2, a.2, FUN = max) . Also, you could make the sel variable and look at it in context with -- a.3$sel <- a.3$b.2 == a.3$b.maxDanais
N
10

The ddply and ave approaches are both fairly resource-intensive, I think. ave fails by running out of memory for my current problem (67,608 rows, with four columns defining the unique keys). tapply is a handy choice, but what I generally need to do is select all the whole rows with the something-est some-value for each unique key (usually defined by more than one column). The best solution I've found is to do a sort and then use negation of duplicated to select only the first row for each unique key. For the simple example here:

a <- sample(1:10,100,replace=T)
b <- sample(1:100,100,replace=T)
f <- data.frame(a, b)

sorted <- f[order(f$a, -f$b),]
highs <- sorted[!duplicated(sorted$a),]

I think the performance gains over ave or ddply, at least, are substantial. It is slightly more complicated for multi-column keys, but order will handle a whole bunch of things to sort on and duplicated works on data frames, so it's possible to continue using this approach.

Needlewoman answered 7/9, 2012 at 18:46 Comment(1)
This was the easiest to use and works great on multiple columns -- all you need to do is to use cbind inside duplicated.Aquanaut
D
8
library(plyr)
ddply(a.3, "a.2", subset, b.2 == max(b.2))
Drank answered 13/5, 2010 at 12:54 Comment(2)
I tried using the ddply function but it is painfully slow. I didnt time it but it lasted a coffecup and a trip to the bathroom whilst the ave version used only .2s in my original dataset (210col*16000rows).Scarletscarlett
That'll be fixed in the next version. But you can't expect to get answers that will work with your data unless you supply a realistic example!Drank
D
6
a.2<-sample(1:10,100,replace=T)
b.2<-sample(1:100,100,replace=T)
a.3<-data.frame(a.2,b.2)

The answer by Jonathan Chang gets you what you explicitly asked for, but I'm guessing that you want the actual row from the data frame.

sel <- ave(b.2, a.2, FUN = max) == b.2
a.3[sel,]
Danais answered 12/5, 2010 at 23:35 Comment(2)
That was much simpler I must admit.. However the logic behind the == b.2 is beyond me...Scarletscarlett
The ave generates a vector that just contains the max of b.2 for every a.2. Therefore, where it == b.2 that sets a truth value as long as the data frame has rows. You're using the logical vector to select rows in the data frame. If you want to see how it's working add the result of the ave command to your data frame and look at it, comparing to the b.2 column -- a.3$b.max <- ave(b.2, a.2, FUN = max) . Also, you could make the sel variable and look at it in context with -- a.3$sel <- a.3$b.2 == a.3$b.maxDanais
S
1
a.2<-sample(1:10,100,replace=T)
b.2<-sample(1:100,100,replace=T)
a.3<-data.frame(a.2,b.2)
m<-split(a.3,a.2)
u<-function(x){
    a<-rownames(x)
    b<-which.max(x[,2])
    as.numeric(a[b])
    }
r<-sapply(m,FUN=function(x) u(x))

a.3[r,]

This does the trick, albeit somewhat cumbersome...But it allows me to grab the rows for the groupwise largest values. Any other ideas?

Scarletscarlett answered 12/5, 2010 at 22:6 Comment(0)
A
1
> a.2<-sample(1:10,100,replace=T)
> b.2<-sample(1:100,100,replace=T)
> tapply(b.2, a.2, max)
 1  2  3  4  5  6  7  8  9 10 
99 92 96 97 98 99 94 98 98 96 
Amigo answered 12/5, 2010 at 23:9 Comment(0)
G
0
a.2<-sample(1:10,100,replace=T)
b.2<-sample(1:100,100,replace=T)
a.3<-data.frame(a.2,b.2)

With aggregate, you can get the maximum for each group in one line:

aggregate(a.3, by = list(a.3$a.2), FUN = max)

This produces the following output:

   Group.1 a.2 b.2
1        1   1  96
2        2   2  82
...
8        8   8  85
9        9   9  93
10      10  10  97
Geochronology answered 4/5, 2017 at 14:26 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.