I have a dataframe which is has dimension of 2377426 rows by 2 columns, which looks something like this:
Name Seq
428293 ENSE00001892940:ENSE00001929862 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
431857 ENSE00001892940:ENSE00001883352 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGGAAGTAAATGAGCTGATGGAAGAGC
432253 ENSE00001892940:ENSE00003623668 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGGAAGTAAATGAGCTGATGGAAGAGC
436213 ENSE00001892940:ENSE00003534967 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGGAAGTAAATGAGCTGATGGAAGAGC
429778 ENSE00001892940:ENSE00002409454 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGAGCTGGGAACCTTTGCTCAAAGCTCC
431263 ENSE00001892940:ENSE00001834214 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGAGCTGGGAACCTTTGCTCAAAGCTCC
All the value in the first column (Name) are unique but there are many duplicates in the column 'Seq'. I want a data.frame which only contains unique sequences and a name. I have tried unique but this is too slow. I have also tried ordering the database and using the following code:
dat_sorted = data[order(data$Seq),]
m = dat_sorted[1,]
x =1;for(i in 1:length(dat_sorted[,1])){if(dat_sorted[i,2]!=m[x,2]){x=x+1;m[x,]=dat_sorted[i,]}}
Again this is too slow! Is there a faster way to find unique value in one column of a dataframe?
?unique
function in R? Also check out?duplicated
. – Barbaryunique
– Melanymelaphyreunique
should be very efficient, could trydistinct
fromdplyr
ordata.table
unique
as inlibrary(data.table); unique(setDT(data), by = "Seq")
OrsetDT(data)[!duplicated(Seq)]
– Melanymelaphyredata %>% group_by(Seq) %>% distinct()
. Also see this similar question #27255565 – Barbaryunique
anddplyr::distinct
on a data frame of ~3.1 million rows anddistinct
was much, much faster - many tens of seconds to a fraction of 1. – Shows