Find all records which have multiple values in a column in R

Asked 21/6, 2018 at 8:21 Answered 21/6, 2018 at 8:37

For a sample dataframe:

df <- structure(list(code = c("a1", "a1", "b2", "v4", "f5", "f5", "h7", 
       "a1"), name = c("katie", "katie", "sally", "tom", "amy", "amy", 
       "ash", "james"), number = c(3.5, 3.5, 2, 6, 4, 4, 7, 3)), .Names = c("code", 
       "name", "number"), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, 
       -8L), spec = structure(list(cols = structure(list(code = structure(list(), class = c("collector_character", 
       "collector")), name = structure(list(), class = c("collector_character", 
       "collector")), number = structure(list(), class = c("collector_double", 
       "collector"))), .Names = c("code", "name", "number")), default = structure(list(), class = c("collector_guess", 
       "collector"))), .Names = c("cols", "default"), class = "col_spec"))

I want to highlight all the records which are have two or more values of 'code' which are the same. I know I could use:

df[duplicated(df$name), ]

But this only highlights the duplicated records, but I want all of the code values which are duplicated (i.e. 3 a1s and 2 f5s).

Any ideas?

Rompers answered 21/6, 2018 at 8:21 Comment(0)

df[duplicated(df$code) | duplicated(df$code, fromLast=TRUE), ]
  code  name number
1   a1 katie    3.5
2   a1 katie    3.5
5   f5   amy    4.0
6   f5   amy    4.0
8   a1 james    3.0

Another solution inspired by Alok VS:

ta <- table(df$code)
df[df$code %in% names(ta)[ta > 1], ]

Edit: If you are ok with leaving base R then gdata::duplicated2() allows for more concision.

library(gdata)
df[duplicated2(df$code), ]

Subtotal answered 21/6, 2018 at 8:35 Comment(4)

Perfect - Thanks @snoram! – Rompers 21/6, 2018 at 8:36

what about a case with 4 duplicates? – Forefinger 21/6, 2018 at 8:38

Should work: try on df <- rbind(df, c(code = "a1", name = "j", number = 9)) – Subtotal 21/6, 2018 at 8:39

duplicated(df$code) skips the first one and duplicated(df$code, fromLast=TRUE) skips the last one. – Subtotal 21/6, 2018 at 8:40

turn the indexes to values - and then check if 'code' fits this values:

 df[df$code %in% df$code[duplicated(df$code)], ]
  code  name number
1   a1 katie    3.5
2   a1 katie    3.5
5   f5   amy    4.0
6   f5   amy    4.0
8   a1 james    3.0

Forefinger answered 21/6, 2018 at 8:37 Comment(0)

I've come up with a crude solution,

temp<-aggregate(df$code, by=list(df$code), FUN=length)
temp<-temp[temp$x>1,]

df[df$code %in% temp$Group.1,]

Hounding answered 21/6, 2018 at 8:37 Comment(0)

Recommended topics

Hot tags