Removing Elements from vector the amount of time it occurs in R
Asked Answered
V

5

3

I want to remove the elements from a vector the amount of time it occurs in my other vector. Like if I would substracting them. Given that every element in my vector of elements I want to remove is also existing in the main vector i want to remove from.

a <- c("A", "B", "B", "C", "C", "C")
b <- c("A", "B", "C", "C")

a[! a %in% b] #returns character(0)

#expected result = "B" "C"

I don't want to use a library for this. I'd rather write a function if possible without loops. Is there a way to do so? Thank you in advance

Vinasse answered 22/3, 2023 at 23:11 Comment(1)
Very possibly a duplicate of "Set Difference" between two vectors with duplicate valuesContessacontest
L
4

This may not be the most efficient, but

Reduce(function(prev, this) {
  ind <- match(this, prev)
  if (length(ind)) prev[-ind[1]] else prev
}, b, init = a)
# [1] "B" "C"

For fun, here's a non-Reduce variant (motivated by looking at AllanCameron's simpler answer) that preserves order. The added complexity is only worth it if preserving order is necessary.

finddiff2 <- function(A, B) {
  dict <- split(seq_along(A), A)
  tb <- table(B)
  nms <- intersect(names(tb), A)
  dict[nms] <- Map(tail, dict[nms], -tb[nms])
  A[sort(unlist(dict))]
}
finddiff2(a, b)
# [1] "B" "C"
finddiff2(rev(a), b)
# [1] "C" "B"
finddiff2(c("A","B"), "A")
# [1] "B"

The preservation is easier to see with a longer a:

a <- rep(c("A","B","C"), times = 4)
finddiff2(a, b)
# [1] "A" "B" "A" "B" "C" "A" "B" "C"
finddiff2(rev(a), b)
# [1] "B" "A" "C" "B" "A" "C" "B" "A"
Lawerencelawes answered 22/3, 2023 at 23:26 Comment(5)
Ah, I get what you mean about the preserved ordering now. Thanks for the addition.Lour
Try: finddiff2(c("A", "B"), "A")Poff
Good find, fixed @PoffLawerencelawes
Maybe A[sort(unlist(dict))] instead of rep(names(dict), lengths(dict))[order(unlist(dict))]? To keep also the type (will e.g. work also with integer and will not convert to charter).Poff
@GKi, that's a great point and recommendation, thanks.Lawerencelawes
V
3

in base R you could use pmatch:

a[-pmatch(b, a, 0)]
[1] "B" "C"

Note that in the above 0 is needed in case there was a value/level in b that does not exist in a

If all the elements in b are in a then the following is sufficient

a[-pmatch(b, a)]
[1] "B" "C"

NB

as @jblood pointed out, pmatch only works with vectors whose length is less than 100

Viscose answered 23/3, 2023 at 1:3 Comment(4)
This should come with a strong caveat of length(a) < 101 | length(b) < 101. Otherwise the result will be incorrect. Compare pmatch(rep("a", 100), rep("a", 100)) to pmatch(rep("a", 101), rep("a", 101)).Thickknee
@Thickknee so far I do not know as to why that is the case, though the behaviour is quite striking. The only thing i have found so far is the notion that the target is not allowed to be longViscose
Line 1603: github.com/wch/r-source/blob/…Thickknee
@Thickknee Thats right. I missed that. haha. yes yes, it shows that both need to be less than 100 otherwise pmatch would just change to charmatch and not allow multiple exact matchesViscose
L
2

If you want to define a simple function, you could do:

finddiff <- function(a, b) {
  levs <- unique(c(a, b))
  tab  <- table(factor(a, levs)) - table(factor(b, levs))
  tab  <- abs(tab[tab != 0])
  rep(names(tab), tab)
}

finddiff(a, b)
#> [1] "B" "C"
Lour answered 22/3, 2023 at 23:31 Comment(6)
I thought about subtracting tables like that, nice approach. The only advantage Reduce has is that it preserves order, which I feared (without verification) that table(.) - table(.) would not. Nice use of factor in there, btw.Lawerencelawes
Thanks @r2evans. The ordering in tables is based on factor levels, so we are guaranteed to get matching tables as long as we specify that a and b are factors with the same levels harvested from the unique values of both vectors. I never thought to use Reduce, but it's a neat idea too.Lour
finddiff doesn't preserve order when the letters are not sorted, but the OP never stated that as a requirement, I assumed it (for the challenge).Lawerencelawes
Maybe tab[tab > 0] instead of abs(tab[tab != 0])Poff
@Poff I guess it depends on what you are trying to extract. tab[tab > 0] would get all elements of a not in b, i.e. A - B but abs(tab[tab != 0]) gets all elements of a and b that are not part of the intersection, i.e. A∪B - A∩B. Both are the same in this example of course, so the OP's aim was open to interpretation.Lour
I would interpret I want to remove the elements as A - B. But yes with: Given that every element in my vector of elements I want to remove is also existing in the main vector i want to remove from. it will not matter.Poff
T
2

Using a data.table anti-join with rowid is fast and preserves order:

library(data.table)
data.table(a, rowid(a))[!data.table(b, rowid(b)), on = .(a = b, V2)][[1]]
#> [1] "B" "C"

A collapse solution that is even faster but does not preserve order:

library(collapse)
rep(names(x <- pmax(fsum(rep(c(1, -1), c(length(a), length(b))), c(a, b)), 0)), x)
#> [1] "B" "C"

Testing it on larger vectors:

set.seed(2041082007)
a <- stringi::stri_rand_strings(2e5, 2)
b <- sample(a, 1e5)

microbenchmark::microbenchmark(
  data.table = data.table(a, rowid(a))[!data.table(b, rowid(b)), on = .(a = b, V2)][[1]],
  collapse = rep(names(x <- pmax(fsum(rep(c(1, -1), c(length(a), length(b))), c(a, b)), 0)), x)
)
#> Unit: milliseconds
#>        expr       min        lq      mean    median        uq     max neval
#>  data.table 19.130200 20.470851 22.632482 22.279351 23.491751 64.2524   100
#>    collapse  4.622602  5.939251  6.959331  6.490801  7.773651 12.1256   100

Compare to the pmatch solution from this answer:

system.time(ab2 <- a[-pmatch(b, a, 0)])
#>    user  system elapsed 
#>   46.53    0.00   46.56

Additionally, pmatch does not seem to behave correctly for this problem:

all.equal(ab1, ab2)
#> [1] "Lengths (100000, 196156) differ (string compare on first 100000)"
#> [2] "99979 string mismatches"

pmatch is returning a much larger vector than expected. Get the difference between the two answers:

ab12 <- data.table(ab2, rowid(ab2))[!data.table(ab1, rowid(ab1)), on = .(ab2 = ab1, V2)][[1]]

Check what is happening with the first element of ab12.

ab12[1]
#> [1] "28"
sum(a == ab12[1])
#> [1] 57
sum(b == ab12[1])
#> [1] 45

"28" appears 57 times in a and 45 times in b, so the result should have 12 instances of "28" as was returned by the anti-join.

sum(ab1 == ab12[1])
#> [1] 12

The pmatch solution, however, erroneously returns a vector that has 56 instances of "28".

sum(ab2 == ab12[1])
#> [1] 56
Thickknee answered 23/3, 2023 at 12:15 Comment(7)
pmatch will also match parts of a string, but this should not be a problem when as given in the question: Every element in my vector of elements I want to remove is also existing in the main vector I want to remove.Poff
At first I was thinking it had to do with partial matching, but it doesn't seem to be the case. It seems to have to do with the vector sizes. Try set.seed(1); a <- sample(LETTERS, 1e3, 1); b <- sample(a, 5e2, 1); ab <- a[-pmatch(b, a, 0)]; sum(a == "A"); sum(b == "A"); sum(ab == "A").Thickknee
@GKi, on the other hand, it seems to work ok for smaller vectors: set.seed(1); a <- sample(LETTERS, 200, 1); b <- sample(a, 100, 1); ab <- a[-pmatch(b, a, 0)]; sum(a == "A"); sum(b == "A"); sum(ab == "A").Thickknee
Yes you are right! pmatch(rep("a", 100), rep("a", 100)) works in my case, while pmatch(rep("a", 101), rep("a", 101)) does not.Poff
Yep. 101 seems to be the transition point. I'm trying to find the .Internal code for pmatch, but I'm having a hard time.Thickknee
Maybe because it "comes from" argument matching and typical there are less than 100 arguments...?Poff
See line 1603 here: github.com/wch/r-source/blob/…Thickknee
C
0
c <- data.frame(table(a) - table(b))
tidyr::uncount(c, Freq)$a

Result

[1] B C
Levels: A B C
Cajuput answered 23/3, 2023 at 0:39 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.