Finding ALL duplicate rows, including "elements with smaller subscripts"
Asked Answered
H

10

167

R's duplicated returns a vector showing whether each element of a vector or data frame is a duplicate of an element with a smaller subscript. So if rows 3, 4, and 5 of a 5-row data frame are the same, duplicated will give me the vector

FALSE, FALSE, FALSE, TRUE, TRUE

But in this case I actually want to get

FALSE, FALSE, TRUE, TRUE, TRUE

that is, I want to know whether a row is duplicated by a row with a larger subscript too.

Hamel answered 21/10, 2011 at 19:37 Comment(0)
O
192

duplicated has a fromLast argument. The "Example" section of ?duplicated shows you how to use it. Just call duplicated twice, once with fromLast=FALSE and once with fromLast=TRUE and take the rows where either are TRUE.


Some late Edit: You didn't provide a reproducible example, so here's an illustration kindly contributed by @jbaums

vec <- c("a", "b", "c","c","c") 
vec[duplicated(vec) | duplicated(vec, fromLast=TRUE)]
## [1] "c" "c" "c"

Edit: And an example for the case of a data frame:

df <- data.frame(rbind(c("a","a"),c("b","b"),c("c","c"),c("c","c")))
df[duplicated(df) | duplicated(df, fromLast=TRUE), ]
##   X1 X2
## 3  c  c
## 4  c  c
Oscillograph answered 21/10, 2011 at 19:56 Comment(2)
Hold on, I just ran a test and found I was wrong: x <- c(1:9, 7:10, 5:22); y <- c(letters, letters[1:5]); test <- data.frame(x, y); test[duplicated(test$x) | duplicated(test$x, fromLast=TRUE), ] Returned all three of he copies of 7, 8, and 9. Why does that work?Rosemare
Because the middle ones are captured no matter if you start from the end or from the front. For example, duplicated(c(1,1,1)) vs duplicated(c(1,1,1,), fromLast = TRUE) gives c(FALSE,TRUE,TRUE) and c(TRUE,TRUE,FALSE). Middle value is TRUE in both cases. Taking | of both vectors gives c(TRUE,TRUE,TRUE).Bott
B
43

You need to assemble the set of duplicated values, apply unique, and then test with %in%. As always, a sample problem will make this process come alive.

> vec <- c("a", "b", "c","c","c")
> vec[ duplicated(vec)]
[1] "c" "c"
> unique(vec[ duplicated(vec)])
[1] "c"
>  vec %in% unique(vec[ duplicated(vec)]) 
[1] FALSE FALSE  TRUE  TRUE  TRUE
Baluchi answered 21/10, 2011 at 19:49 Comment(2)
Agree. Might even slow down processing but unlikely to slow it down very much.Baluchi
Quite true. The OP did not offer a data example to test for "ever duplicated" rows in a dataframe. I think my suggestion of using duplicated, unique and %in% could easily be generalized to a dataframe if one were to first paste each row with an unusual separator character. (The accepted answer is better.)Baluchi
R
25

Duplicated rows in a dataframe could be obtained with dplyr by doing

library(tidyverse)
df = bind_rows(iris, head(iris, 20)) # build some test data
df %>% group_by_all() %>% filter(n()>1) %>% ungroup()

To exclude certain columns group_by_at(vars(-var1, -var2)) could be used instead to group the data.

If the row indices and not just the data is actually needed, you could add them first as in:

df %>% add_rownames %>% group_by_at(vars(-rowname)) %>% filter(n()>1) %>% pull(rowname)
Rennin answered 17/6, 2019 at 13:47 Comment(5)
Nice use of n(). Don't forget to ungroup the resulting dataframe.Mouthpart
@Mouthpart I've adjusted the answer to ungroup the resultRennin
@HolgerBrandl, @qwr, The general answer is useful, but I don't understand how to pick column(s) to exclude. What is the "vars" refer to in group_by_at(vars(-var1, -var2))? Are var1 and var2 column names in a datatable named vars? I assume the negative signs signify exclusion, right? So the rest of the process (filter and ungroup) acts on the rest of the columns in that datatable vars, but not including var1 and var2 is that right? Sorry to be so pedantic, but I often have problems with quick shorthand!Houseraising
vars is a method in dplyr, see dplyr.tidyverse.org/reference/vars.html . var1, var2 indeed refer to column names to be excluded from the duplication check. Duplication is assessed on the grouping variables in the suggested solution. Indeed, negative signifies exclusion.Rennin
group_by_all() and group_by_at() have been superseded in recent versions of dplyr. Now you can do this: iris %>% group_by(across()) %>% filter(n() > 1) %>% ungroup()Drue
N
6

Here is @Joshua Ulrich's solution as a function. This format allows you to use this code in the same fashion that you would use duplicated():

allDuplicated <- function(vec){
  front <- duplicated(vec)
  back <- duplicated(vec, fromLast = TRUE)
  all_dup <- front + back > 0
  return(all_dup)
}

Using the same example:

vec <- c("a", "b", "c","c","c") 
allDuplicated(vec) 
[1] FALSE FALSE  TRUE  TRUE  TRUE

Nedry answered 23/4, 2020 at 15:26 Comment(0)
K
4

This is how vctrs::vec_duplicate_detect() works

# on a vector
vctrs::vec_duplicate_detect(c(1, 2, 1))
#> [1]  TRUE FALSE  TRUE
# on a data frame
vctrs::vec_duplicate_detect(mtcars[c(1, 2, 1),])
#> [1]  TRUE FALSE  TRUE

Created on 2022-07-19 by the reprex package (v2.0.1)

Koss answered 19/7, 2022 at 23:37 Comment(0)
J
3

I've had the same question, and if I'm not mistaken, this is also an answer.

vec[col %in% vec[duplicated(vec$col),]$col]

Dunno which one is faster, though, the dataset I'm currently using isn't big enough to make tests which produce significant time gaps.

Jocular answered 1/6, 2016 at 14:26 Comment(1)
This answer seems to use vec both as an atomic vector and as a dataframe. I suspect that with an actual datframe it would fail.Baluchi
L
2

I had a similar problem but I needed to identify duplicated rows by values in specific columns. I came up with the following dplyr solution:

df <- df %>% 
  group_by(Column1, Column2, Column3) %>% 
  mutate(Duplicated = case_when(length(Column1)>1 ~ "Yes",
                            TRUE ~ "No")) %>%
  ungroup()

The code groups the rows by specific columns. If the length of a group is greater than 1 the code marks all of the rows in the group as duplicated. Once that is done you can use Duplicated column for filtering etc.

Lacefield answered 15/5, 2020 at 19:52 Comment(0)
M
0

If you are interested in which rows are duplicated for certain columns you can use a plyr approach:

ddply(df, .(col1, col2), function(df) if(nrow(df) > 1) df else c())

Adding a count variable with dplyr:

df %>% add_count(col1, col2) %>% filter(n > 1)  # data frame
df %>% add_count(col1, col2) %>% select(n) > 1  # logical vector

For duplicate rows (considering all columns):

df %>% group_by_all %>% add_tally %>% ungroup %>% filter(n > 1)
df %>% group_by_all %>% add_tally %>% ungroup %>% select(n) > 1

The benefit of these approaches is that you can specify how many duplicates as a cutoff.

Mouthpart answered 6/6, 2019 at 21:14 Comment(0)
D
0

This updates @Holger Brandl's answer to reflect recent versions of dplyr (e.g. 1.0.5), in which group_by_all() and group_by_at() have been superseded. The help doc suggests using across() instead.

Thus, to get all rows for which there is a duplicate you can do this: iris %>% group_by(across()) %>% filter(n() > 1) %>% ungroup()

To include the indices of such rows, add a 'rowid' column but exclude it from the grouping: iris %>% rowid_to_column() %>% group_by(across(!rowid)) %>% filter(n() > 1) %>% ungroup()

Append %>% pull(rowid) after the above and you'll get a vector of the indices.

Drue answered 29/9, 2021 at 21:24 Comment(0)
P
0

If you want to create a new column which lists "TRUE" for any row where the value "id" is duplicated, it took me awhile to figure this out:

data %>% mutate(duplicate_id = if_else(id %in% id[duplicated(id), TRUE, FALSE))

Prouty answered 7/9, 2023 at 14:58 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.