Using a data.table
anti-join with rowid
is fast and preserves order:
library(data.table)
data.table(a, rowid(a))[!data.table(b, rowid(b)), on = .(a = b, V2)][[1]]
#> [1] "B" "C"
A collapse
solution that is even faster but does not preserve order:
library(collapse)
rep(names(x <- pmax(fsum(rep(c(1, -1), c(length(a), length(b))), c(a, b)), 0)), x)
#> [1] "B" "C"
Testing it on larger vectors:
set.seed(2041082007)
a <- stringi::stri_rand_strings(2e5, 2)
b <- sample(a, 1e5)
microbenchmark::microbenchmark(
data.table = data.table(a, rowid(a))[!data.table(b, rowid(b)), on = .(a = b, V2)][[1]],
collapse = rep(names(x <- pmax(fsum(rep(c(1, -1), c(length(a), length(b))), c(a, b)), 0)), x)
)
#> Unit: milliseconds
#> expr min lq mean median uq max neval
#> data.table 19.130200 20.470851 22.632482 22.279351 23.491751 64.2524 100
#> collapse 4.622602 5.939251 6.959331 6.490801 7.773651 12.1256 100
Compare to the pmatch
solution from this answer:
system.time(ab2 <- a[-pmatch(b, a, 0)])
#> user system elapsed
#> 46.53 0.00 46.56
Additionally, pmatch
does not seem to behave correctly for this problem:
all.equal(ab1, ab2)
#> [1] "Lengths (100000, 196156) differ (string compare on first 100000)"
#> [2] "99979 string mismatches"
pmatch
is returning a much larger vector than expected. Get the difference between the two answers:
ab12 <- data.table(ab2, rowid(ab2))[!data.table(ab1, rowid(ab1)), on = .(ab2 = ab1, V2)][[1]]
Check what is happening with the first element of ab12
.
ab12[1]
#> [1] "28"
sum(a == ab12[1])
#> [1] 57
sum(b == ab12[1])
#> [1] 45
"28" appears 57 times in a
and 45 times in b
, so the result should have 12 instances of "28" as was returned by the anti-join.
sum(ab1 == ab12[1])
#> [1] 12
The pmatch
solution, however, erroneously returns a vector that has 56 instances of "28".
sum(ab2 == ab12[1])
#> [1] 56