Find duplicated rows (based on 2 columns) in Data Frame in R
Asked Answered
V

8

58

I have a data frame in R which looks like:

| RIC    | Date                | Open   |
|--------|---------------------|--------|
| S1A.PA | 2011-06-30 20:00:00 | 23.7   |
| ABC.PA | 2011-07-03 20:00:00 | 24.31  |
| EFG.PA | 2011-07-04 20:00:00 | 24.495 |
| S1A.PA | 2011-07-05 20:00:00 | 24.23  |

I want to know if there's any duplicates regarding to the combination of RIC and Date. Is there a function for that in R?

Voiced answered 8/8, 2011 at 18:19 Comment(0)
A
85

You can always try simply passing those first two columns to the function duplicated:

duplicated(dat[,1:2])

assuming your data frame is called dat. For more information, we can consult the help files for the duplicated function by typing ?duplicated at the console. This will provide the following sentences:

Determines which elements of a vector or data frame are duplicates of elements with smaller subscripts, and returns a logical vector indicating which elements (rows) are duplicates.

So duplicated returns a logical vector, which we can then use to extract a subset of dat:

ind <- duplicated(dat[,1:2])
dat[ind,]

or you can skip the separate assignment step and simply use:

dat[duplicated(dat[,1:2]),]
Antabuse answered 8/8, 2011 at 18:23 Comment(5)
How can I retrieve the duplicated rows? I don't know how the results of duplicated function is indexed.Voiced
@user802231 - Edited to address your further query.Antabuse
I tried, but the result seems not correct. What I get is showed below: (number in front of each line is row name) RIC Date 107515 7541.T 2011-06-30 20:00:00 107516 7541.T 2011-07-03 20:00:00 107517 7541.T 2011-07-04 20:00:00 107518 7541.T 2011-07-05 20:00:00 107519 7541.T 2011-07-06 20:00:00 107520 7541.T 2011-07-07 20:00:00 107521 7541.T 2011-07-10 20:00:00 107522 7541.T 2011-07-11 20:00:00 107523 7541.T 2011-07-12 20:00:00 107524 7541.T 2011-07-13 20:00:00 107525 7541.T 2011-07-14 20:00:00 107526 7541.T 2011-07-18 20:00:00Voiced
@user802231 What's the problem?Antabuse
Watch out with this solution!! It will only return TRUE for exactly the same combination of columns 1 and 2, not if the digits are inversed. In other words: A,B (that is, if these are the values of column 1 and column 2, respectively) will be flagged as duplicate if there's another A,B, but not if there's a B,AParttime
E
30

dplyr is so much nicer for this sort of thing:

library(dplyr)
yourDataFrame %>%
    distinct(RIC, Date, .keep_all = TRUE)

(the ".keep_all is optional. if not used, it will return only the deduped 2 columns. when used, it returns the deduped whole data frame)

Enate answered 17/7, 2017 at 15:36 Comment(2)
How would you do it if you just want to know whether duplicate values exist?Sechrist
While this is a useful trick in general, it doesn't answer the question the OP posted, which is how one identifies duplicate observations.Carotid
B
22

Here's a dplyr option for tagging duplicates based on two (or more) columns. In this case ric and date:

df <- data_frame(ric = c('S1A.PA', 'ABC.PA', 'EFG.PA', 'S1A.PA', 'ABC.PA', 'EFG.PA'),
                 date = c('2011-06-30 20:00:00', '2011-07-03 20:00:00', '2011-07-04 20:00:00', '2011-07-05 20:00:00', '2011-07-03 20:00:00', '2011-07-04 20:00:00'),
                 open = c(23.7, 24.31, 24.495, 24.23, 24.31, 24.495))

df %>% 
  group_by(ric, date) %>% 
  mutate(dupe = n()>1)
# A tibble: 6 x 4
# Groups:   ric, date [4]
  ric    date                 open dupe 
  <chr>  <chr>               <dbl> <lgl>
1 S1A.PA 2011-06-30 20:00:00  23.7 FALSE
2 ABC.PA 2011-07-03 20:00:00  24.3 TRUE 
3 EFG.PA 2011-07-04 20:00:00  24.5 TRUE 
4 S1A.PA 2011-07-05 20:00:00  24.2 FALSE
5 ABC.PA 2011-07-03 20:00:00  24.3 TRUE 
6 EFG.PA 2011-07-04 20:00:00  24.5 TRUE 
Bootblack answered 19/3, 2019 at 1:1 Comment(0)
A
10

Easy way to get the information you want is to use dplyr.

library(dplyr)

yourDF %>% 
  group_by(RIC, Date) %>% 
  mutate(num_dups = n(), 
         dup_id = row_number()) %>% 
  ungroup() %>% 
  mutate(is_duplicated = dup_id > 1)
# A tibble: 6 × 6
  RIC    Date                 open num_dups dup_id is_duplicated
  <chr>  <chr>               <dbl>    <int>  <int> <lgl>        
1 S1A.PA 2011-06-30 20:00:00  23.7        1      1 FALSE        
2 ABC.PA 2011-07-03 20:00:00  24.3        2      1 FALSE        
3 EFG.PA 2011-07-04 20:00:00  24.5        2      1 FALSE        
4 S1A.PA 2011-07-05 20:00:00  24.2        1      1 FALSE        
5 ABC.PA 2011-07-03 20:00:00  24.3        2      2 TRUE         
6 EFG.PA 2011-07-04 20:00:00  24.5        2      2 TRUE  

Using this:

  • num_dups tells you how many times that particular combo is duplicated
  • dup_id tells you which duplicate number that particular row is (e.g. 1st, 2nd, or 3rd, etc)
  • is_duplicated gives you an easy condition you can filter on later to remove all the duplicate rows (e.g. filter(!is_duplicated)), though you could also use dup_id for this (e.g. filter(dup_id == 1))
Alcyone answered 27/6, 2020 at 23:8 Comment(0)
C
5

If you want to remove duplicate records based on values of Columns Date and State in dataset data.frame:

#Indexes of the duplicate rows that will be removed: 
duplicate_indexes <- which(duplicated(dataset[c('Date', 'State')]),) 
duplicate_indexes 

#new_uniq will contain unique dataset without the duplicates. 
new_uniq <- dataset[!duplicated(dataset[c('Date', 'State')]),] 
View(new_uniq) 
Com answered 1/11, 2017 at 7:25 Comment(1)
I tried dataset[c('Date', 'State')] got error, dataset[,c('Date', 'State')] works.Maritamaritain
V
2

I think what you're looking for is a way to return a data frame of the duplicated rows in the same format as your original data. There is probably a more elegant way to do this but this works:

dup <- data.frame(as.numeric(duplicated(df$var))) #creates df with binary var for duplicated rows
colnames(dup) <- c("dup") #renames column for simplicity
df2 <- cbind(df, dup) #bind to original df
df3 <- subset(df2, dup == 1) #subsets df using binary var for duplicated`
Vasilikivasilis answered 20/4, 2012 at 18:55 Comment(0)
I
1

Found quite a masterful idea posted by Steve Lianouglou that helps solve this problem with the great advantage of indexing the repetitions:

If you generate a hash column concatenating both your columns for which you want to check duplicates, you can then use dplyr::n() together with seq to give an index to each duplicate occurrence as follows

dat %>% mutate(hash = str_c(RIC, Date)) %>%
  group_by(hash) %>% 
  mutate(duplication_id = seq(n()) %>%
 ungroup ()

Your column duplication_id tells you how many identical rows (same row values for both columns) are there in your table above the one indexed. I used this to remove duplicate Ids.

Incident answered 29/3, 2022 at 0:13 Comment(0)
M
0

The way of df [df [, base::c ('key1', 'key2')] |> base::duplicated.data.frame () |> base::which ()] could only show the surpluses part of the duplicates.

You can use this to filt rows which key(s) is appears more than once:

library (magrittr)

#' @name check_duprows
#' @description 
#' 
#' check duplicated rows by key(s) in df
#' 
#' @example 
#' `df %>% check_duprows (key1, key2, ...)`
#' 
#' @references 
#' - main: [ans-62616469](https://mcmap.net/q/329135/-find-duplicated-rows-based-on-2-columns-in-data-frame-in-r/62616469#62616469)
#' - select except: [ans-49515461](https://mcmap.net/q/331556/-dplyr-select-all-variables-except-for-those-contained-in-vector/49515461#49515461)
#' - sort/order/arrange: [ans-6871968](https://mcmap.net/q/45220/-sort-order-data-frame-rows-by-multiple-columns/6871968#6871968)
#' 
check_duprows = 
function (df, ..., .show_all = F) df %>% 
    dplyr::group_by (...) %>% 
    dplyr::mutate (
        .dup_count = dplyr::n (), 
        .dup_rownum = dplyr::row_number ()) %>% 
    (dplyr::ungroup) %>% 
    dplyr::mutate (
        .is_duplicated = .dup_rownum > 1, 
        .has_duplicated = .dup_count > 1) %>% 
    (\ (tb) if (.show_all) tb else tb %>% 
        dplyr::filter (.has_duplicated) %>% 
        dplyr::select (- tidyselect::one_of ('.has_duplicated'))) %>% 
    dplyr::arrange (...) %>% 
    {.} ;

Then just use like:

df %>% check_duprows (key1, key2, ...)

Such as:

base::data.frame (
    
    RIC = base::c (
        'S1A.PA', 'ABC.PA', 'EFG.PA', 
        'S1A.PA', 'ABC.PA', 'EFG.PA'), 
    Date = base::c (
        '2011-06-30 20:00:00', 
        '2011-07-03 20:00:00', 
        '2011-07-04 20:00:00', 
        '2011-07-05 20:00:00', 
        '2011-07-03 20:00:00', 
        '2011-07-04 20:00:00'), 
    Open = stats::runif (n=6, min=20, max=30)
    
    ) -> df

df %>% check_duprows (RIC, Date)

And you can also define a uniquer by this function:

unique_duprows = 
function (df, ...) df %>% 
    check_duprows(..., .show_all = T) %>% 
    dplyr::filter(!.is_duplicated) %>% 
    dplyr::select(- tidyselect::one_of (
        '.has_duplicated', 
        '.is_duplicated', 
        '.dup_count', 
        '.dup_rownum')) %>% 
    {.} ;

df %>% dplyr::arrange (Open) %>% unique_duprows (RIC, Date)

It's just like a distinct finction !!

Demo on webr and shinylive.

Maritamaritain answered 10/4 at 9:27 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.