Subsetting a data frame with top-n rows for each group, and ordered by a variable

P

6

8

I would like to subset a data frame for n rows, which are grouped by a variable and are sorted descending by another variable. This would be clear with an example:

    d1 <- data.frame(Gender = c("M", "M", "F", "F", "M", "M", "F", 
  "F"), Age = c(15, 38, 17, 35, 26, 24, 20, 26))

I would like to get 2 rows, which are sorted descending on Age, for each Gender. The desired output is:

Gender  Age  
F   35  
F   26  
M   38  
M   26

I looked for order, sort and other solutions here, but could not find an appropriate solution to this problem. I appreciate your help.

Pellegrini answered 20/5, 2011 at 17:38 Comment(1)

Do you only want the largest two ages for each gender? – Pilferage 20/5, 2011 at 17:47

F

13

One solution using ddply() from plyr

require(plyr)
ddply(d1, "Gender", function(x) head(x[order(x$Age, decreasing = TRUE) , ], 2))

Fraternity answered 20/5, 2011 at 18:5 Comment(3)

I didn't see your answer before posting mine! Much better. – Scallion 20/5, 2011 at 18:13

that worked beautifully! I can even modify the "n" value. Thanks. – Pellegrini 20/5, 2011 at 18:24

@brandon and it also works even if your n is more than the actual number of rows in a group. So if you have 6 females and 5 males, and you change n to 5, you will get top 5 rows for females and all for males. This is exactly what I wanted – Pellegrini 20/5, 2011 at 18:39

M

6

With data.table package

require(data.table)
dt1<-data.table(d1)# to speedup you can add setkey(dt1,Gender)
dt1[,.SD[order(Age,decreasing=TRUE)[1:2]],by=Gender]

Miracidium answered 20/5, 2011 at 18:34 Comment(1)

Instead of order(Age,decreasing=TRUE) can write order(-Age). That way you can order by several columns each in a different direction; e.g., order(-Age,+Height,-Weight). – Dangle 8/5, 2012 at 16:22

S

1

I'm sure there is a better answer, but here is one way:

require(plyr)
ddply(d1, c("Gender", "-Age"))[c(1:2, 5:6),-1]

If you have a larger data frame than the one you provided here and don't want to inspect visually which rows to select, just use this:

new.d1=ddply(d1, c("Gender", "-Age"))[,-1]
pos=match('M',new.d1$Gender) # pos wil show index of first entry of M
new.d1[c(1:2,pos:(pos+1)),]

Scallion answered 20/5, 2011 at 18:8 Comment(2)

thanks for your solution, Manoel, but I did not try it as chase' solution worked for me. – Pellegrini 20/5, 2011 at 18:25

@karlos, of course. His solution is better than mine. In fact, yersterday he just helped me with a question and he used plyr as well. Not surprising, he used 'ddply' better than me. – Scallion 20/5, 2011 at 18:35

B

0

It is even easier than that if you just want to do the sorting:

d1 <- transform(d1[order(d1$Age, decreasing=TRUE), ], Gender=as.factor(Gender))

you can then call:

require(plyr)
d1 <- ddply(d1, .(Gender), head, n=2)

to subset the top two of each Gender subgroup.

Bethsaida answered 25/9, 2011 at 16:56 Comment(0)

O

0

I have a suggestion if you need, for example, the first 2 females and the first 3 males:

library(plyr)
m<-d1[order(d1$Age, decreasing = TRUE) , ] 
h<-mapply(function(x,y) head(x,y), split(m$Age,m$Gender),y=c(2,3)) 
ldply (h, data.frame)

You just need to change the names of the final dataframe.

Overspend answered 5/1, 2017 at 19:28 Comment(0)

R

0

d1 = d1[order(d1$Gender, -d1$Age),]  
d1 = d1[ave(d1$Age, d1$Gender, FUN = seq_along) <= 2, ]

Had a similar problem and found this method really fast when used on a data.frame with 1.5 million records

Rib answered 30/3, 2019 at 10:54 Comment(0)

Recommended topics

Hot tags