remove IDs that occur x times R
Asked Answered
M

3

5

I have a df and I would like to remove people who have less than X amount of rows in df. E.g., in this toy example, I would like to retain people who have >= 5 rows.

df
   names  fruit
4   john   kiwi
7   john  apple
9   john banana
13  john orange
14  john  apple
2   mary orange
5   mary  apple
8   mary orange
10  mary  apple
12  mary  apple
1    tom  apple
3    tom banana
6    tom  apple
11   tom   kiwi

example output

df
   names  fruit
4   john   kiwi
7   john  apple
9   john banana
13  john orange
14  john  apple
2   mary orange
5   mary  apple
8   mary orange
10  mary  apple
12  mary  apple

Thanks in advance!

Muntin answered 18/8, 2013 at 18:55 Comment(0)
G
6

You can use table like this:

df[df$names %in% names(table(df$names))[table(df$names) >= 5],]
Greenlaw answered 18/8, 2013 at 19:2 Comment(1)
excellent stuff. Thank you. Just to highlight that the "names" that occurs directly after %in% is part of the syntax for the function and does not refer to the column called "names" in my df.Muntin
A
6

Here's a data.table solution using the in-built .N value, which is as described in the ?data.table help file: ‘.N’ is an integer, length 1, containing the number of rows in the group.

#create a similar reproducible exmaple
library(data.table)
dat <- data.table(names=rep(letters[1:3],c(5,5,3)),var=1:13)

Remove the rows:

dat[, cnt:=.N, by=names][cnt >= 5]

Though I feel like there must be a way to do this without assigning a new variable. ...And now there is thanks to @mnel in the comments:

dat[,if(.N>=5).SD,by=names]

This essentially returns a sub-data.table .SD for each value of the by group if the number of rows in the group .N is greater than or equal to 5. It is pretty much equivalent to the more traditional R subsetting syntax of:

dat[,.SD[.N >= 5],by=names]
Antiphony answered 19/8, 2013 at 0:4 Comment(2)
dat[,if(.N>=5).SD,by=names]Sussex
@Mnel makes a great point. The advantage to using the if statement in j is that it avoids the overhead of calling .SD when the clause evaluates to FALSEGautea
C
0

An alternate solution could be to use the subset() command as below:

subset(df, ave(names, names, FUN = length) >= 5)

Or alternatively,

df[ave(df$names, df$names, FUN = length) >= 5, ]
Chirlin answered 14/6, 2021 at 5:22 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.