Removing Elements from vector the amount of time it occurs in R

Asked 22/3, 2023 at 23:11 Answered 23/3, 2023 at 12:15

I want to remove the elements from a vector the amount of time it occurs in my other vector. Like if I would substracting them. Given that every element in my vector of elements I want to remove is also existing in the main vector i want to remove from.

a <- c("A", "B", "B", "C", "C", "C")
b <- c("A", "B", "C", "C")

a[! a %in% b] #returns character(0)

#expected result = "B" "C"

I don't want to use a library for this. I'd rather write a function if possible without loops. Is there a way to do so? Thank you in advance

Vinasse answered 22/3, 2023 at 23:11 Comment(1)

Very possibly a duplicate of "Set Difference" between two vectors with duplicate values – Contessacontest 23/3, 2023 at 0:19

This may not be the most efficient, but

Reduce(function(prev, this) {
  ind <- match(this, prev)
  if (length(ind)) prev[-ind[1]] else prev
}, b, init = a)
# [1] "B" "C"

For fun, here's a non-Reduce variant (motivated by looking at AllanCameron's simpler answer) that preserves order. The added complexity is only worth it if preserving order is necessary.

finddiff2 <- function(A, B) {
  dict <- split(seq_along(A), A)
  tb <- table(B)
  nms <- intersect(names(tb), A)
  dict[nms] <- Map(tail, dict[nms], -tb[nms])
  A[sort(unlist(dict))]
}
finddiff2(a, b)
# [1] "B" "C"
finddiff2(rev(a), b)
# [1] "C" "B"
finddiff2(c("A","B"), "A")
# [1] "B"

The preservation is easier to see with a longer a:

a <- rep(c("A","B","C"), times = 4)
finddiff2(a, b)
# [1] "A" "B" "A" "B" "C" "A" "B" "C"
finddiff2(rev(a), b)
# [1] "B" "A" "C" "B" "A" "C" "B" "A"

Lawerencelawes answered 22/3, 2023 at 23:26 Comment(5)

Ah, I get what you mean about the preserved ordering now. Thanks for the addition. – Lour 22/3, 2023 at 23:44

Try: finddiff2(c("A", "B"), "A") – Poff 23/3, 2023 at 13:33

Good find, fixed @Poff – Lawerencelawes 23/3, 2023 at 13:36

Maybe A[sort(unlist(dict))] instead of rep(names(dict), lengths(dict))[order(unlist(dict))]? To keep also the type (will e.g. work also with integer and will not convert to charter). – Poff 23/3, 2023 at 13:44

@GKi, that's a great point and recommendation, thanks. – Lawerencelawes 23/3, 2023 at 15:23

in base R you could use pmatch:

a[-pmatch(b, a, 0)]
[1] "B" "C"

Note that in the above 0 is needed in case there was a value/level in b that does not exist in a

If all the elements in b are in a then the following is sufficient

a[-pmatch(b, a)]
[1] "B" "C"

NB

as @jblood pointed out, pmatch only works with vectors whose length is less than 100

Viscose answered 23/3, 2023 at 1:3 Comment(4)

This should come with a strong caveat of length(a) < 101 | length(b) < 101. Otherwise the result will be incorrect. Compare pmatch(rep("a", 100), rep("a", 100)) to pmatch(rep("a", 101), rep("a", 101)). – Thickknee 23/3, 2023 at 14:51

@Thickknee so far I do not know as to why that is the case, though the behaviour is quite striking. The only thing i have found so far is the notion that the target is not allowed to be long – Viscose 23/3, 2023 at 15:6

Line 1603: github.com/wch/r-source/blob/… – Thickknee 23/3, 2023 at 15:7

@Thickknee Thats right. I missed that. haha. yes yes, it shows that both need to be less than 100 otherwise pmatch would just change to charmatch and not allow multiple exact matches – Viscose 23/3, 2023 at 15:10

If you want to define a simple function, you could do:

finddiff <- function(a, b) {
  levs <- unique(c(a, b))
  tab  <- table(factor(a, levs)) - table(factor(b, levs))
  tab  <- abs(tab[tab != 0])
  rep(names(tab), tab)
}

finddiff(a, b)
#> [1] "B" "C"

Lour answered 22/3, 2023 at 23:31 Comment(6)

I thought about subtracting tables like that, nice approach. The only advantage Reduce has is that it preserves order, which I feared (without verification) that table(.) - table(.) would not. Nice use of factor in there, btw. – Lawerencelawes 22/3, 2023 at 23:32

Thanks @r2evans. The ordering in tables is based on factor levels, so we are guaranteed to get matching tables as long as we specify that a and b are factors with the same levels harvested from the unique values of both vectors. I never thought to use Reduce, but it's a neat idea too. – Lour 22/3, 2023 at 23:42

finddiff doesn't preserve order when the letters are not sorted, but the OP never stated that as a requirement, I assumed it (for the challenge). – Lawerencelawes 22/3, 2023 at 23:44

Maybe tab[tab > 0] instead of abs(tab[tab != 0]) – Poff 23/3, 2023 at 9:54

@Poff I guess it depends on what you are trying to extract. tab[tab > 0] would get all elements of a not in b, i.e. A - B but abs(tab[tab != 0]) gets all elements of a and b that are not part of the intersection, i.e. A∪B - A∩B. Both are the same in this example of course, so the OP's aim was open to interpretation. – Lour 23/3, 2023 at 10:25

I would interpret I want to remove the elements as A - B. But yes with: Given that every element in my vector of elements I want to remove is also existing in the main vector i want to remove from. it will not matter. – Poff 23/3, 2023 at 10:33

Using a data.table anti-join with rowid is fast and preserves order:

library(data.table)
data.table(a, rowid(a))[!data.table(b, rowid(b)), on = .(a = b, V2)][[1]]
#> [1] "B" "C"

A collapse solution that is even faster but does not preserve order:

library(collapse)
rep(names(x <- pmax(fsum(rep(c(1, -1), c(length(a), length(b))), c(a, b)), 0)), x)
#> [1] "B" "C"

Testing it on larger vectors:

set.seed(2041082007)
a <- stringi::stri_rand_strings(2e5, 2)
b <- sample(a, 1e5)

microbenchmark::microbenchmark(
  data.table = data.table(a, rowid(a))[!data.table(b, rowid(b)), on = .(a = b, V2)][[1]],
  collapse = rep(names(x <- pmax(fsum(rep(c(1, -1), c(length(a), length(b))), c(a, b)), 0)), x)
)
#> Unit: milliseconds
#>        expr       min        lq      mean    median        uq     max neval
#>  data.table 19.130200 20.470851 22.632482 22.279351 23.491751 64.2524   100
#>    collapse  4.622602  5.939251  6.959331  6.490801  7.773651 12.1256   100

Compare to the pmatch solution from this answer:

system.time(ab2 <- a[-pmatch(b, a, 0)])
#>    user  system elapsed 
#>   46.53    0.00   46.56

Additionally, pmatch does not seem to behave correctly for this problem:

all.equal(ab1, ab2)
#> [1] "Lengths (100000, 196156) differ (string compare on first 100000)"
#> [2] "99979 string mismatches"

pmatch is returning a much larger vector than expected. Get the difference between the two answers:

ab12 <- data.table(ab2, rowid(ab2))[!data.table(ab1, rowid(ab1)), on = .(ab2 = ab1, V2)][[1]]

Check what is happening with the first element of ab12.

ab12[1]
#> [1] "28"
sum(a == ab12[1])
#> [1] 57
sum(b == ab12[1])
#> [1] 45

"28" appears 57 times in a and 45 times in b, so the result should have 12 instances of "28" as was returned by the anti-join.

sum(ab1 == ab12[1])
#> [1] 12

The pmatch solution, however, erroneously returns a vector that has 56 instances of "28".

sum(ab2 == ab12[1])
#> [1] 56

Thickknee answered 23/3, 2023 at 12:15 Comment(7)

pmatch will also match parts of a string, but this should not be a problem when as given in the question: Every element in my vector of elements I want to remove is also existing in the main vector I want to remove. – Poff 23/3, 2023 at 12:27

At first I was thinking it had to do with partial matching, but it doesn't seem to be the case. It seems to have to do with the vector sizes. Try

set.seed(1); a <- sample(LETTERS, 1e3, 1); b <- sample(a, 5e2, 1); ab <- a[-pmatch(b, a, 0)]; sum(a == "A"); sum(b == "A"); sum(ab == "A")

. – Thickknee 23/3, 2023 at 12:36

@GKi, on the other hand, it seems to work ok for smaller vectors:

set.seed(1); a <- sample(LETTERS, 200, 1); b <- sample(a, 100, 1); ab <- a[-pmatch(b, a, 0)]; sum(a == "A"); sum(b == "A"); sum(ab == "A")

. – Thickknee 23/3, 2023 at 12:43

Yes you are right! pmatch(rep("a", 100), rep("a", 100)) works in my case, while pmatch(rep("a", 101), rep("a", 101)) does not. – Poff 23/3, 2023 at 12:52

Yep. 101 seems to be the transition point. I'm trying to find the .Internal code for pmatch, but I'm having a hard time. – Thickknee 23/3, 2023 at 13:6

Maybe because it "comes from" argument matching and typical there are less than 100 arguments...? – Poff 23/3, 2023 at 13:10

See line 1603 here: github.com/wch/r-source/blob/… – Thickknee 23/3, 2023 at 13:48

c <- data.frame(table(a) - table(b))
tidyr::uncount(c, Freq)$a

Result

[1] B C
Levels: A B C

Cajuput answered 23/3, 2023 at 0:39 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

NB

Recommended topics

Hot tags