How can I remove all duplicates so that NONE are left in a data frame?

Asked 7/12, 2012 at 12:35 Answered 10/6, 2024 at 14:0

There is a similar question for PHP, but I'm working with R and am unable to translate the solution to my problem.

I have this data frame with 10 rows and 50 columns, where some of the rows are absolutely identical. If I use unique on it, I get one row per - let's say - "type", but what I actually want is to get only those rows which only appear once. Does anyone know how I can achieve this?

I can have a look at clusters and heatmaps to sort it out manually, but I have bigger data frames than the one mentioned above (with up to 100 rows) where this gets a bit tricky.

Klehm answered 7/12, 2012 at 12:35 Comment(0)

This will extract the rows which appear only once (assuming your data frame is named df):

df[!(duplicated(df) | duplicated(df, fromLast = TRUE)), ]

How it works: The function duplicated tests whether a line appears at least for the second time starting at line one. If the argument fromLast = TRUE is used, the function starts at the last line.

Boths boolean results are combined with | (logical 'or') into a new vector which indicates all lines appearing more than once. The result of this is negated using ! thereby creating a boolean vector indicating lines appearing only once.

Ilianailine answered 7/12, 2012 at 12:40 Comment(0)

A possibility involving dplyr could be:

df %>%
 group_by_all() %>%
 filter(n() == 1)

Or:

df %>%
 group_by_all() %>%
 filter(!any(row_number() > 1))

Since dplyr 1.0.0, the preferable way would be:

data %>%
    group_by(across(everything())) %>%
    filter(n() == 1)

Wonderment answered 21/7, 2019 at 21:20 Comment(0)

An approach using vctrs::vec_duplicate_detect

Original example

library(vctrs)

vec <- c(1, 2, 2, 3, 4, 3, 2)

vec[!vec_duplicate_detect(vec)]
[1] 1 4

On a data.frame

df
  a b d
1 1 1 1
2 1 1 1
3 2 2 2
4 3 3 4

df[!vec_duplicate_detect(df),]
  a b d
3 2 2 2
4 3 3 4

Benchmark

length(vec)
[1] 175120

library(microbenchmark)

microbenchmark(
  base = {vec[!(duplicated(vec) | duplicated(vec, fromLast=T))]}, 
  vctrs = {vec[!vec_duplicate_detect(vec)]})
Unit: milliseconds
  expr       min        lq     mean   median       uq      max neval
  base 12.241369 14.408094 16.70000 16.94082 17.26830 26.69546   100
 vctrs  7.526593  9.701161 11.43675 10.80420 11.64395 19.80494   100

Lambrequin answered 10/6, 2024 at 14:0 Comment(0)

Benchmark

Recommended topics

Hot tags