obtaining 3 most common elements of groups, concatenating ties, and ignoring less common values
Asked Answered
L

3

6

I am trying to get the 3 most common numbers per group of a dataframe, using a function, but ignoring the less common values (per group), and allowing a unique number if present. Accepted answer will have the lowest system.time

#my current function
library(plyr)
get.3modes.andcounts<- function(origtable,groupby,columnname) {
  data <- ddply (origtable, groupby, .fun = function(xx){
    c(m1 = paste(names(sort(table(xx[,columnname]),decreasing=TRUE)[1])),
      m2 = paste(names(sort(table(xx[,columnname]),decreasing=TRUE)[2])),
      m3 = paste(names(sort(table(xx[,columnname]),decreasing=TRUE)[3])),
      counts=length2(xx[[columnname]], na.rm=TRUE) #http://www.cookbook-r.com/Manipulating_data/Summarizing_data/
    ) } ) 
  return(data)
}
length2 <- function (x, na.rm=FALSE) {
  if (na.rm) sum(!is.na(x))
  else       length(x)
}
# example df
col2<-c(4, 4, 4, 4, 5, 3, 3, 3, 2, 2, # group1 "5" is the less common
        2, 2, 2, 4, 4, 3, 3, 2, 2, 2, # group2 "3" and "4" are equally less common, and there is 2 more frequent
        4, 4, 4, 4, 4, 4, 4, 4, 4, 4, # group3 "4" is unique
        4, 4, 4, 4, 5, 5, 5, 5, 2, 2, # group4 "2" is the less common, other ties more frequent
        4, 4, 4, 4, 4, 5, 5, 5, 5, 5) # group5 "4" and "5" are equally common and no value is less common (similar to unique)
col1<-paste(c(rep("group1",10),rep("group2",10),rep("group3",10),rep("group4",10),rep("group5",10)), sep=", ")
df<-data.frame(col1=col1,col2=col2)

get.3modes.andcounts(df,"col1","col2")

#CURRENT result 
col1 m1 m2 m3 counts
1 group1  4  3  2     10 # ok
2 group2  2  3  4     10 # no, 3 and 4 are the less common
3 group3  4 NA NA     10 # ok
4 group4  4  5  2     10 # no, 2 is less common
5 group5  4  5 NA     10 # ok

# desired
col1 m1 m2 m3 counts
1 group1  4  3  2     10
2 group2  2 NA NA     10
3 group3  4 NA NA     10
4 group4  4  5 NA     10
5 group5  4  5 NA     10

EDIT: The real sample has several ties, and having more than 3 columns is undesired. More than 3 numbers (in 3 columns) are accepted only if ties present. That is why, I decided to ask for another type of output.
EDIT: group7. Only three most common wanted. Exception, ties that include the 3rd most common (like in other groups).

    # EXAMPLE 2
    # new proposal
col2<-c(4, 4, 4, 4, 5, 3, 3, 3, 2, 2, 6, 6, # group1 2 and 6 tied in the 3rd position, 5 less common
        2, 2, 2, 4, 4, 3, 3, 2, 2, 2, 6, 6, # group2 4, 3 and 6 tied in the less common, excluded.
        4, 4, 4, 7, 7, 7, 5, 5, 5, 4, 4, 6, # group3 4, 7 and 5 more common, 3 most common present, exclude everything else
        4, 4, 4, 4, 5, 5, 5, 5, 2, 2, 6, 6, # group4 2 and 6 less common, excluded (4 AND 5 tied)
        4, 4, 4, 4, 4, 5, 5, 5, 5, 5, 6, 6, # group5 6 less common, excluded, (4 and 5 tied)
        4, 4, 4, 3, 3, 3, 2, 2, 2, 1, 1, 1, # group6 all tied
      14,14,14,16,16,16,16,34,34,42,42,42,80,80,84,92, #group7 16, 14, 42 are the three most freq.
      20,52,40,40,40,20,20,60,60,50) #group 8 20,40 tied, 60 next.
col1<-paste(c(rep("group1",12),rep("group2",12),rep("group3",12),rep("group4",12),rep("group5",12),
              rep("group6",12),rep("group7",16),rep("group8", 10)), sep=", ") 
df<-data.frame(col1=col1,col2=col2)

#desired output
    col1 m1       m2   m3 counts
1 group1  4        3  2,6     12 # 2 and 6 tied in the 3rd position, 5 less common
2 group2  2       NA   NA     12 # 4, 3 and 6 tied in the less common, excluded.
3 group3  4      7,5   NA     12 # three most common numbers present, exclude everything else 
4 group4  4,5     NA   NA     12 # 2 and 6 less common excluded (4 AND 5 tied)
5 group5  4,5     NA   NA     12 # 6 less common, excluded, (4 and 5 tied)
6 group6  4,3,2,1 NA   NA     12 # all tied
7 group7  16    14,42  NA     16 # three most frequent present, discard others
8 group8  20,40   60   NA     10 # three most frequent present
Lambrecht answered 9/3, 2017 at 14:53 Comment(0)
L
3

You could change n > 0, and it will work. Your question asks for 3, but my answer will be more generic by accepting any positive integer.

Using base R:

myfun <- function( data, n = 3, col1, col2 )
{
  ## n: numeric: total number of most common elements per group
  stopifnot( n > 0 )

  a1 <- lapply( split( data, data[[col1]] ), function( x ) { # split data by col1
    # browser()
    val  <- factor( x[[col2]] )                     # factor of data values
    z1   <- tabulate( val )                         # frequency table of levels of val
    z2   <- sort( z1[ z1 > 0 ], decreasing = TRUE ) # sorted frequency table with >0
    lenx <- length( unique( z2 ) )                  # length of unique of z2

    if ( lenx == 1 ) {  # lenx == 1
      return( c( paste( ( levels(val)[ which( z1 %in% z2 ) ] ), collapse = ','), rep(NA_character_, n - 1 ), sum( z1 ) ) )
    } else if ( lenx > 1 ) { # lenx > 1
      # remove the minimum, and and extract values by using levels of val with indices from the match of z1 and z2
      z2 <- setdiff( z2, min( z2 ) )
      z2 <- sapply( z2, function( y ) paste( levels(val)[ which( z1 %in% y ) ], collapse = ',') )      

      # count the length of z2 and get indices of length >= n
      z2_ind <- which( cumsum( lengths(unlist( lapply(z2, strsplit, split = "," ), 
                                               recursive = F ) ) ) >= n )
      if( length( z2_ind ) > 0 ) {
        z2 <- z2[ seq_len( z2_ind[1] ) ]
      }
      # adjust length by assigning NA
      if( length(z2) != n ) { z2[ (length(z2)+1):n ] <- NA_character_ }

      return( c( z2, sum( z1 ) ) )
    } else { # lenx < 1
      return( as.list( rep(NA_character_, n ), NA_character_ ) )
    }})  

  a1 <- do.call('rbind', a1)  # row bind values of a1
  a1 <- data.frame( group = rownames( a1 ), a1, stringsAsFactors = FALSE )
  colnames( a1 ) <- c( 'group', paste( 'm', 1:n, sep = '' ), 'count' )
  rownames( a1 ) <- NULL   # remove row names
  return( a1 )
}

Output:

# example1:
myfun(df, 3, 'col1', 'col2')
#    group   m1 m2 m3 count
# 1 group1    4  3  2    10
# 2 group2    2 NA NA    10
# 3 group3    4 NA NA    10
# 4 group4 4, 5 NA NA    10
# 5 group5 4, 5 NA NA    10

# example 2
myfun(df3, 3, 'col1', 'col2')
#    group         m1     m2   m3 count
# 1 group1          4      3 2, 6    12
# 2 group2          2     NA   NA    12
# 3 group3          4   5, 7   NA    12
# 4 group4       4, 5     NA   NA    12
# 5 group5       4, 5     NA   NA    12
# 6 group6 4, 3, 2, 1     NA   NA    12
# 7 group7         16 14, 42   NA    16

Create character data instead of numeric data by assigning letters to column 3 of example 1 data df.

set.seed(1L)
df$col3 <- sample( letters, 50, TRUE )
myfun(df, 3, 'col1', 'col3')
#    group                  m1   m2   m3 count
# 1 group1                   x <NA> <NA>    10
# 2 group2                 j,u <NA> <NA>    10
# 3 group3 a,d,f,g,i,j,k,q,w,y <NA> <NA>    10
# 4 group4                   m <NA> <NA>    10
# 5 group5                   u <NA> <NA>    10
Ludewig answered 12/3, 2017 at 8:24 Comment(0)
P
6

Using dplyr and tidyr (an updated version of plyr):

library(dplyr)
library(tidyr)

df %>% 
  group_by(col1, col2) %>% 
  summarise(n = n()) %>% 
  mutate(m = min_rank(desc(n)), count = sum(n)) %>% 
  filter(m <= 3 & (m != max(m) | m == 1)) %>% 
  group_by(col1, m, count) %>% 
  summarize(a = paste(col2, collapse = ',')) %>% 
  spread(m, a, sep = '') %>% 
  ungroup
# # A tibble: 7 × 5
#     col1 count      m1    m2    m3
# * <fctr> <int>   <chr> <chr> <chr>
# 1 group1    12       4     3   2,6
# 2 group2    12       2  <NA>  <NA>
# 3 group3    12       4   5,7  <NA>
# 4 group4    12     4,5  <NA>  <NA>
# 5 group5    12     4,5  <NA>  <NA>
# 6 group6    12 1,2,3,4  <NA>  <NA>
# 7 group7    16      16 14,42  <NA>

If you need it inside a function:

get.3modes.andcounts <- function(origtable, groupby, columnname) {
  origtable %>% 
    group_by_(groupby, columnname) %>% 
    summarise(n = n()) %>% 
    mutate(r = min_rank(desc(n)), count = sum(n)) %>% 
    filter(r <= 3 & (r != max(r) | r == 1)) %>% 
    group_by_(groupby, 'r', 'count') %>% 
    summarize_(a = paste0('paste(',columnname, ', collapse = ",")')) %>% 
    spread(r, a, sep = '') %>% 
    ungroup
}

get.3modes.andcounts(df, 'col1', 'col2')
# # A tibble: 7 × 5
#     col1 count      m1    m2    m3
# * <fctr> <int>   <chr> <chr> <chr>
# 1 group1    12       4     3   2,6
# 2 group2    12       2  <NA>  <NA>
# 3 group3    12       4   5,7  <NA>
# 4 group4    12     4,5  <NA>  <NA>
# 5 group5    12     4,5  <NA>  <NA>
# 6 group6    12 1,2,3,4  <NA>  <NA>
# 7 group7    16      16 14,42  <NA>

System.time

system.time(get.3modes.andcounts(df, 'col1', 'col2'))
#    user  system elapsed 
#   0.012   0.000   0.011 
benchmark(get.3modes.andcounts(df, 'col1', 'col2'), replications = 10, columns = c("test", "replications", "elapsed"))
#                                       test replications elapsed
# 1 get.3modes.andcounts(df, "col1", "col2")           10    0.08
benchmark(get.3modes.andcounts(df, 'col1', 'col2'), replications = 100, columns = c("test", "replications", "elapsed"))
#                                       test replications elapsed
# 1 get.3modes.andcounts(df, "col1", "col2")          100   0.684
benchmark(get.3modes.andcounts(df, 'col1', 'col2'), replications = 1000, columns = c("test", "replications", "elapsed"))
#                                       test replications elapsed
# 1 get.3modes.andcounts(df, "col1", "col2")         1000   6.796

Data:

col2 <- c(4, 4, 4, 4, 5, 3, 3, 3, 2, 2, 6, 6, # group1 2 and 6 tied in the 3rd position, 5 less common
        2, 2, 2, 4, 4, 3, 3, 2, 2, 2, 6, 6, # group2 4, 3 and 6 tied in the less common, excluded.
        4, 4, 4, 7, 7, 7, 5, 5, 5, 4, 4, 6, # group3 4, 7 and 5 more common, 3 most common present, exclude everything else
        4, 4, 4, 4, 5, 5, 5, 5, 2, 2, 6, 6, # group4 2 and 6 less common, excluded (4 AND 5 tied)
        4, 4, 4, 4, 4, 5, 5, 5, 5, 5, 6, 6, # group5 6 less common, excluded, (4 and 5 tied)
        4, 4, 4, 3, 3, 3, 2, 2, 2, 1, 1, 1, # group6 all tied
        14,14,14,16,16,16,16,34,34,42,42,42,80,80,84,92) #group7 16, 14, 42 are the three most freq.
col1 <- paste(c(rep("group1", 12), rep("group2", 12), rep("group3", 12), rep("group4", 12), rep("group5", 12),
              rep("group6", 12), rep("group7", 16)), sep = ", ") 
df <- data.frame(col1=col1,col2=col2)
Phoebe answered 9/3, 2017 at 15:45 Comment(3)
Updated to reflect changes in the question.Phoebe
You can also use the 'library(tidyverse)' command which loads both 'dplyr' and 'tidyr'Instigation
not working for group 8, unwanted NA, between 40 and 60Lambrecht
L
3

You could change n > 0, and it will work. Your question asks for 3, but my answer will be more generic by accepting any positive integer.

Using base R:

myfun <- function( data, n = 3, col1, col2 )
{
  ## n: numeric: total number of most common elements per group
  stopifnot( n > 0 )

  a1 <- lapply( split( data, data[[col1]] ), function( x ) { # split data by col1
    # browser()
    val  <- factor( x[[col2]] )                     # factor of data values
    z1   <- tabulate( val )                         # frequency table of levels of val
    z2   <- sort( z1[ z1 > 0 ], decreasing = TRUE ) # sorted frequency table with >0
    lenx <- length( unique( z2 ) )                  # length of unique of z2

    if ( lenx == 1 ) {  # lenx == 1
      return( c( paste( ( levels(val)[ which( z1 %in% z2 ) ] ), collapse = ','), rep(NA_character_, n - 1 ), sum( z1 ) ) )
    } else if ( lenx > 1 ) { # lenx > 1
      # remove the minimum, and and extract values by using levels of val with indices from the match of z1 and z2
      z2 <- setdiff( z2, min( z2 ) )
      z2 <- sapply( z2, function( y ) paste( levels(val)[ which( z1 %in% y ) ], collapse = ',') )      

      # count the length of z2 and get indices of length >= n
      z2_ind <- which( cumsum( lengths(unlist( lapply(z2, strsplit, split = "," ), 
                                               recursive = F ) ) ) >= n )
      if( length( z2_ind ) > 0 ) {
        z2 <- z2[ seq_len( z2_ind[1] ) ]
      }
      # adjust length by assigning NA
      if( length(z2) != n ) { z2[ (length(z2)+1):n ] <- NA_character_ }

      return( c( z2, sum( z1 ) ) )
    } else { # lenx < 1
      return( as.list( rep(NA_character_, n ), NA_character_ ) )
    }})  

  a1 <- do.call('rbind', a1)  # row bind values of a1
  a1 <- data.frame( group = rownames( a1 ), a1, stringsAsFactors = FALSE )
  colnames( a1 ) <- c( 'group', paste( 'm', 1:n, sep = '' ), 'count' )
  rownames( a1 ) <- NULL   # remove row names
  return( a1 )
}

Output:

# example1:
myfun(df, 3, 'col1', 'col2')
#    group   m1 m2 m3 count
# 1 group1    4  3  2    10
# 2 group2    2 NA NA    10
# 3 group3    4 NA NA    10
# 4 group4 4, 5 NA NA    10
# 5 group5 4, 5 NA NA    10

# example 2
myfun(df3, 3, 'col1', 'col2')
#    group         m1     m2   m3 count
# 1 group1          4      3 2, 6    12
# 2 group2          2     NA   NA    12
# 3 group3          4   5, 7   NA    12
# 4 group4       4, 5     NA   NA    12
# 5 group5       4, 5     NA   NA    12
# 6 group6 4, 3, 2, 1     NA   NA    12
# 7 group7         16 14, 42   NA    16

Create character data instead of numeric data by assigning letters to column 3 of example 1 data df.

set.seed(1L)
df$col3 <- sample( letters, 50, TRUE )
myfun(df, 3, 'col1', 'col3')
#    group                  m1   m2   m3 count
# 1 group1                   x <NA> <NA>    10
# 2 group2                 j,u <NA> <NA>    10
# 3 group3 a,d,f,g,i,j,k,q,w,y <NA> <NA>    10
# 4 group4                   m <NA> <NA>    10
# 5 group5                   u <NA> <NA>    10
Ludewig answered 12/3, 2017 at 8:24 Comment(0)
M
3

No extra package needed. Try this:

count <- function(df) {
  count_n <- function(vec, n) {
    fac <- factor(table(vec), levels = sort(unique(table(vec)), decreasing = T))
    top3 <- na.omit(names(sort(fac)[1:3]))
    min <- names(fac[fac == min(levels(fac))])
    if(length(levels(fac))==1){min <- 'NA'}
    top3 <- setdiff(top3,min)
    nums <- na.omit(names(fac[fac == levels(fac)[n]]))
    ifelse(length(intersect(nums, top3))>0,  paste(nums, collapse = ','),'NA')
  } ##Get the number of rank n. 
  group <- unique(as.character(df$col1))
  m1 <- aggregate(df, list(df$col1), count_n, 1)$col2
  m2 <- aggregate(df, list(df$col1), count_n, 2)$col2
  m3 <- aggregate(df, list(df$col1), count_n, 3)$col2
  count <- aggregate(df, list(df$col1), length)$col2

  res <- data.frame(col1 = group, m1, m2, m3, count)
  res
}
Meneau answered 12/3, 2017 at 12:45 Comment(7)
I am getting two different NAs: <NA> and NA, when all the data in m3 is NA, it is show as NA, when not, <NA>Lambrecht
Oh, I get it. You want 'NA' , not the real NA. I have modified my code, It should be OK now. @LambrechtMeneau
I didn't get two kinds of NAs in my hand. NA values just appear as <NA> when you print a data.frame. You can replace them with df[is.na(df)] <- valueMeneau
Not exactly. excuse me, I think it was only a matter of printing, it was ok. sometimes appears as <NA> and in "empty" column NA, but both are correct NAs.Lambrecht
got your point at last. You only want the top 3 common numbers. I've made a update to my answer. It is more compact and works fine now. @LambrechtMeneau
Works fine now. @LambrechtMeneau
Hi, I would like to pass the column names as arguments of the function. And the resulting columns not as factors, as characters or num.Lambrecht

© 2022 - 2024 — McMap. All rights reserved.