Create a variable capturing the most frequent occurence by group
Asked Answered
H

3

7

Define:

df1 <-data.frame(
id=c(rep(1,3),rep(2,3)),
v1=as.character(c("a","b","b",rep("c",3)))
)

s.t.

> df1
  id v1
1  1  a
2  1  b
3  1  b
4  2  c
5  2  c
6  2  c

I want to create a third variable freq that contains the most frequent observation in v1 by id s.t.

> df2
  id v1 freq
1  1  a    b
2  1  b    b
3  1  b    b
4  2  c    c
5  2  c    c
6  2  c    c
Hayes answered 28/6, 2011 at 21:38 Comment(3)
how are ties supposed to be handled within id group?Showiness
@Showiness In my case I am sure there are no ties.Hayes
Good question about ties, I'll make a note about how my solution handles that...Burgh
B
3

You can do this using ddply and a custom function to pick out the most frequent value:

myFun <- function(x){
    tbl <- table(x$v1)
    x$freq <- rep(names(tbl)[which.max(tbl)],nrow(x))
    x
}

ddply(df1,.(id),.fun=myFun)

Note that which.max will return the first occurrence of the maximum value, in the case of ties. See ??which.is.max in the nnet package for an option that breaks ties randomly.

Burgh answered 28/6, 2011 at 21:51 Comment(0)
R
3

Another way consists of using tidyverse functions:

  • grouping first, using group_by(), and counting the occurrence of the second variable using tally()
  • arranging by the number of occurrences with arrange()
  • summarizing and picking out the first row with summarize() and first()

Therefore:

df1 %>%
group_by(id, v1) %>%
tally() %>%
arrange(id, desc(n)) %>%
summarize(freq = first(v1))

This will give you just the mapping (which I find cleaner):

# A tibble: 2 x 2
     id   freq
  <dbl> <fctr>
1     1      b
2     2      c

You can then left_join your original data frame with that table.

Resent answered 14/11, 2017 at 11:39 Comment(1)
I like that approach because one can check for and identify ties after tally(). That might be possible with @joran's great function too but not so straight forward as here, at least for meProbationer
T
1
mode <- function(x) names(table(x))[ which.max(table(x)) ]
df1$freq <- ave(df1$v1, df1$id, FUN=mode)
> df1
  id v1 freq
1  1  a    b
2  1  b    b
3  1  b    b
4  2  c    c
5  2  c    c
6  2  c    c
Tad answered 28/6, 2011 at 22:4 Comment(3)
I think df2 is a typo, and when I run this I get NAs for id=2.Burgh
The typo is gone, but I still don't think this code works. When id=2, max(table(x)) returns 3, but table(x) has only 1 name, so your function mode returns NA.Burgh
It is accidentally giving the correct result, because of an accident of factors. df$id is a factor and the 3rd level is "c". Fixed.Tad

© 2022 - 2024 — McMap. All rights reserved.