Function to find symmetric difference (opposite of intersection) in R?
Asked Answered
Y

5

12

The Problem

I have two string vectors of different lengths. Each vector has a different set of strings. I want to find the strings that are in one vector but not in both; that is, the symmetric difference.

Analysis

I looked at the function setdiff, but its output depends on the order in which the vectors are considered. I found the custom function outersect, but this function requires the two vectors to be of the same length.

Any suggestions?

Correction

This issue seems to be specific to the data with which I am working. Otherwise, the answer below addresses the problem I mention in this post. I will look to see what is unique about my data and post back if I learn anything that might be helpful to other users.

Yardstick answered 5/11, 2013 at 20:9 Comment(1)
In addition to the existing answers: There is also sets::set_symdiff().Inappreciative
G
23

Why not:

sym_diff <- function(a,b) setdiff(union(a,b), intersect(a,b))
Gloriole answered 5/11, 2013 at 20:13 Comment(7)
Thanks for the suggestion, but this function doesn't work; the output is incorrect. I think it gets tripped up by the fact that the vectors differ in length.Yardstick
Can you post some example code in your question showing some sample inputs and what you expect to be the output?Gloriole
@user2932774, this seems to correctly answer the question you posted and it does not depend on the vectors being the same length although without sample data and expected output you may have miscommunicated your intent.Oat
I see what you're saying, when I use sample data sym_diff works. For some reason, it doesn't work on the data on which I originally wanted to apply this solution. Thanks again for the suggestion.Yardstick
@user2932774 Right... so can you post the data on which the solution is not working?Gloriole
I'm new to StackOverflow and I don't know how to do that yet; I'll have to look into it. I'll probably not do so because I've already gotten a lot of down votes and I'm not sure what I'm doing wrong that is upsetting so many people.Yardstick
@user2932774 In the r tag the community appreciates a well-researched question, and a reproducible example where there is data. Otherwise it seems to be a well formed question.Gloriole
D
10

Another option that is a bit faster is:

sym_diff2 <- function(a,b) unique(c(setdiff(a,b), setdiff(b,a)))

If we compare it with the answer by Blue Magister:

sym_diff <- function(a,b) setdiff(union(a,b), intersect(a,b))

library(microbenchmark)
library(MASS)

set.seed(1)
cars1 <- sample(Cars93$Make, 70)
cars2 <- sample(Cars93$Make, 70)

microbenchmark(sym_diff(cars1, cars2), sym_diff2(cars1, cars2), times = 10000L)

>Unit: microseconds
>                  expr     min       lq     mean   median      uq      max neval
>sym_diff(cars1, cars2) 114.719 119.7785 150.7510 125.0410 131.177 12382.02 10000
>sym_diff2(cars1, cars2) 94.369 100.0205 121.6051 103.8285 109.239 12013.69 10000

identical(sym_diff(cars1, cars2), sym_diff2(cars1, cars2))
>[1] TRUE

The speed difference between these two methods increases when the samples compared are larger (thousands or more), but I couldn't find an example dataset to use with that many variables.

Dongola answered 11/3, 2016 at 20:26 Comment(1)
This doesn't need unique does it? Shouldn't the oputput of setdiff(a,b) be distinct from the output of setdiff(b,a) already?Adventitia
B
3

Here is another symmetric difference function, this one from the definition (that can be seen, for instance, in the Wikipedia page linked to in the question).

sym_diff3 <- function(a, b) union(setdiff(a, b), setdiff(b, a))

Including the function in the test run in this other answer by user sebpardo gives approximately the same timings, a little slower. Output omitted.

identical(sym_diff(cars1, cars2), sym_diff3(cars1, cars2))
#[1] TRUE

microbenchmark(sym_diff(cars1, cars2),
               sym_diff2(cars1, cars2), 
               sym_diff3(cars1, cars2),
               times = 10000L)
Betray answered 14/4, 2020 at 11:33 Comment(0)
H
2

You can use symdiff in dplyr since 1.1.0:

library(dplyr)
symdiff(1:3, 3:5)
#[1] 1 2 4 5
Holliehollifield answered 23/10, 2022 at 11:43 Comment(0)
T
0

This is an old question but if you want a faster function, you want to avoid Set Operations functions like setdiff or union because they are using duplicated or unique so you are basically repeating removing duplicates each time. Using match and then removing duplicates at the end looks to be the fastest. For character vectors, using data.table::chmatch is faster than match.

library(data.table)
x1 <- janeaustenr::austen_books()$text |> sample(3e3)
x2 <- janeaustenr::austen_books()$text |> sample(3e3)
symdiff_dt <- function(x, y) {
  c(
    x[chmatch(x, y, 0L) == 0L],
    y[chmatch(y, x, 0L) == 0L]
  ) |>
    unique()
}
symdiff_match <- function(x, y) {
  c(x[!x %in% y], y[!y %in% x]) |> unique()
}
symdiff_setdiff1 <- function(x, y) {
  c(
    setdiff(x, y),
    setdiff(y, x)
  ) |>
    unique()
}
symdiff_setdiff2 <- function(x, y) {
  setdiff(
    union(x, y),
    intersect(x, y)
  )
}

microbenchmark::microbenchmark(
  symdiff_dt = symdiff_dt(x1, x2),
  symdiff_match = symdiff_match(x1, x2),
  symdiff_setdiff1 = symdiff_setdiff1(x1, x2),
  symdiff_setdiff2 = symdiff_setdiff2(x1, x2),
  check = "equal"
)
#> Unit: microseconds
#>              expr   min     lq     mean median     uq     max neval
#>        symdiff_dt 327.5 386.90  489.628  409.2 462.70  2876.3   100
#>     symdiff_match 405.6 519.25  809.597  555.3 646.95 12428.0   100
#>  symdiff_setdiff1 532.6 662.00  954.322  718.0 817.00 10741.9   100
#>  symdiff_setdiff2 675.2 767.40 1040.671  823.0 946.00 10056.3   100

Created on 2024-01-04 with reprex v2.0.2

Tameka answered 4/1 at 16:11 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.