how to speed up a 'unique' dataframe search
Asked Answered
C

5

5

I have a dataframe which is has dimension of 2377426 rows by 2 columns, which looks something like this:

                   Name                                            Seq
428293 ENSE00001892940:ENSE00001929862 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
431857 ENSE00001892940:ENSE00001883352 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGGAAGTAAATGAGCTGATGGAAGAGC
432253 ENSE00001892940:ENSE00003623668 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGGAAGTAAATGAGCTGATGGAAGAGC
436213 ENSE00001892940:ENSE00003534967 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGGAAGTAAATGAGCTGATGGAAGAGC
429778 ENSE00001892940:ENSE00002409454 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGAGCTGGGAACCTTTGCTCAAAGCTCC
431263 ENSE00001892940:ENSE00001834214 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGAGCTGGGAACCTTTGCTCAAAGCTCC

All the value in the first column (Name) are unique but there are many duplicates in the column 'Seq'. I want a data.frame which only contains unique sequences and a name. I have tried unique but this is too slow. I have also tried ordering the database and using the following code:

dat_sorted = data[order(data$Seq),]
    m = dat_sorted[1,]
    x =1;for(i in 1:length(dat_sorted[,1])){if(dat_sorted[i,2]!=m[x,2]){x=x+1;m[x,]=dat_sorted[i,]}}

Again this is too slow! Is there a faster way to find unique value in one column of a dataframe?

Cherokee answered 3/12, 2014 at 9:4 Comment(5)
Did you know that there is a ?unique function in R? Also check out ?duplicated.Barbary
@beginneR, I think he mentioned he tried uniqueMelanymelaphyre
unique should be very efficient, could try distinct from dplyr or data.table unique as in library(data.table); unique(setDT(data), by = "Seq") Or setDT(data)[!duplicated(Seq)]Melanymelaphyre
the dplyr version would be data %>% group_by(Seq) %>% distinct(). Also see this similar question #27255565Barbary
For what it's worth, I've just compared unique and dplyr::distinct on a data frame of ~3.1 million rows and distinct was much, much faster - many tens of seconds to a fraction of 1.Shows
E
5
data[!duplicated(data$Seq), ]

should do the trick.

Edaedacious answered 3/12, 2014 at 9:7 Comment(2)
What for is that , at the end of the line? What does it mean?Carpathoukraine
!duplicated is incredibly slow; in fact when using it for 2 columns, it just takes for ever even on a dataset of only 10 mln rows. kit::funique is the way to go.Steve
E
4
library(dplyr)
data %>% distinct

Should be worth for it, especially if your data is too big to your machine.

Endoskeleton answered 17/9, 2020 at 6:12 Comment(0)
T
3

For the fastest, you can try:

data[!kit::fduplicated(data$Seq), ]

here are some benchmark taken directly from the documentation:

x = sample(c(1:10,NA_integer_),1e8,TRUE) # 382 Mb
microbenchmark::microbenchmark(
  duplicated(x),
  fduplicated(x),
  times = 5L
)
# Unit: seconds
#           expr  min   lq  mean  median   uq   max neval
# duplicated(x)  2.21 2.21  2.48    2.21 2.22  3.55     5
# fduplicated(x) 0.38 0.39  0.45    0.48 0.49  0.50     5

kit also has a funique function.

Tamera answered 25/2, 2021 at 12:17 Comment(1)
kit::funique is the way to go; !duplicated is incredibly slow (even for non-lists and atomic vectors).Steve
C
0

kit::fduplicated seems to have a slight advantage in dataframes with many unique rows (few repetitions), while dplyr::distinct seems to be slighty more efficient with dataframes with many repeated rows (few unique rows):

# Make this example reproducible
set.seed(1)
n_samples <- 1e7

# Many unique rows case: Create a data frame with random integers between 1 and 100
df <- as.data.frame(matrix(round(runif(n=n_samples, min=1, max=1000), 0), nrow=n_samples/2))
names(df) <- c('A', 'B')

microbenchmark::microbenchmark(
  un_1 <- df[!base::duplicated(df), ],
  un_2 <- df[!kit::fduplicated(df), ],
  un_3 <- dplyr::distinct(df),
  times = 5L
)

# Unit: milliseconds
#                                expr       min         lq       mean     median         uq        max neval
# un_1 <- df[!base::duplicated(df), ] 9817.6096 10173.5799 10721.0293 10772.2749 11073.4896 11768.1927     5
# un_2 <- df[!kit::fduplicated(df), ]  558.9923   618.1214   673.6863   628.9305   671.2307   891.1565     5
#         un_3 <- dplyr::distinct(df)  596.9396   640.1986   680.0212   643.6371   674.5296   844.8010     5


# Many repeated rows case: Create a data frame with random integers between 1 and 10
df <- as.data.frame(matrix(round(runif(n=n_samples, min=1, max=10), 0), nrow=n_samples/2))
names(df) <- c('A', 'B')

microbenchmark::microbenchmark(
  un_1 <- df[!base::duplicated(df), ],
  un_2 <- df[!kit::fduplicated(df), ],
  un_3 <- dplyr::distinct(df),
  times = 5L
)

#Unit: milliseconds
#                                 expr       min        lq     mean    median        uq       max neval
#  un_1 <- df[!base::duplicated(df), ] 8282.4409 8439.2752 8550.715 8457.0352 8704.7729 8870.0511     5
#  un_2 <- df[!kit::fduplicated(df), ]  130.8126  136.0880  244.323  168.6322  221.6255  564.4568     5
#          un_3 <- dplyr::distinct(df)  148.4684  160.8196  162.815  165.0068  169.5027  170.2775     5
Cirrocumulus answered 21/3, 2023 at 11:42 Comment(0)
F
0

My version with a few columns. It is the fastest way which i found.

uniq_df <- df %>%
 collapse::funique(., cols = c('col1', 'col2'))
Felon answered 30/12, 2023 at 8:47 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.