Following up from my question here, I am trying to replicate in R the functionality of the Stata command duplicates tag
, which allows me to tag all the rows of a dataset that are duplicates in terms of a given set of variables:
clear *
set obs 16
g f1 = _n
expand 104
bys f1: g f2 = _n
expand 2
bys f1 f2: g f3 = _n
expand 41
bys f1 f2 f3: g f4 = _n
des // describe the dataset in memory
preserve
sample 10 // draw a 10% random sample
tempfile sampledata
save `sampledata', replace
restore
// append the duplicate rows to the data
append using `sampledata'
sort f1-f4
duplicates tag f1-f4, generate(dupvar)
browse if dupvar == 1 // check that all duplicate rows have been tagged
Edit
Here is what Stata produces (added on @Arun's request):
f1 f2 f3 f4 dupvar 1 1 1 1 0 1 1 1 2 0 1 1 1 3 1 1 1 1 3 1 1 1 1 4 0 1 1 1 5 0 1 1 1 6 0 1 1 1 7 0 1 1 1 8 1 1 1 1 8 1
Note that for (f1, f2, f3, f4) = (1, 1, 1, 3)
there are two rows, and both of those are marked dupvar = 1
. Similarly, for the two rows that are duplicates for (f1, f2, f3, f4) =(1, 1, 1, 8)
.
R:
The base function duplicated
tags only the second duplicate onwards. So, I wrote a function to replicate the Stata functionality in R, using ddply
.
# Values of (f1, f2, f3, f4) uniquely identify observations
dfUnique = expand.grid(f1 = factor(1:16),
f2 = factor(1:41),
f3 = factor(1:2),
f4 = factor(1:104))
# sample some extra rows and rbind them
dfDup = rbind(dfUnique, dfUnique[sample(1:nrow(dfUnique), 100), ])
# dummy data
dfDup$data = rnorm(nrow(dfDup))
# function: use ddply to tag all duplicate rows in the data
fnDupTag = function(dfX, indexVars) {
dfDupTag = ddply(dfX, .variables = indexVars, .fun = function(x) {
if(nrow(x) > 1) x$dup = 1 else x$dup = 0
return(x)
})
return(dfDupTag)
}
# test the function
indexVars = paste0('f', 1:4, sep = '')
dfTemp = fnDupTag(dfDup, indexVars)
But as in the linked question, performance is a huge issue. Another possible solution is
dfDup$dup = duplicated(dfDup[, indexVars]) |
duplicated(dfDup[, indexVars], fromLast = TRUE)
dfDupSorted = with(dfDup, dfDup[order(eval(parse(text = indexVars))), ])
I have a few questions:
1. Is it possible to make the ddply
version faster?
2. Is the second version using duplicated
correct? For more than two copies of the duplicated rows?
3. How would I do this using data.table
? Would that be faster?
dupvar
just withave(1:nrow(dd), dd[, 1:4], FUN = function(x) length(x) > 1)
– Monacid