How to find intersection between all possible pairs of sets in a 2-column table?
Asked Answered
L

2

5

I want to calculate an overlap coefficient between sets. My data comes as a 2-column table, such as:

df_example <- 
  tibble::tribble(~my_group, ~cities,
                   "foo",   "london",
                   "foo",   "paris", 
                   "foo",   "rome", 
                   "foo",   "tokyo",
                   "foo",   "oslo",
                   "bar",   "paris", 
                   "bar",   "nyc",
                   "bar",   "rome", 
                   "bar",   "munich",
                   "bar",   "warsaw",
                   "bar",   "sf", 
                   "baz",   "milano",
                   "baz",   "oslo",
                   "baz",   "sf",  
                   "baz",   "paris")

In df_example, I have 3 sets (i.e., foo, bar, baz), and members of each set are given in cities.

I would like to end up with a table that intersects all possible pairs of sets, and specifies the size of the smaller set in each pair. This will give rise to calculating an overlap coefficient for each pair of sets.

(Overlap coefficient = number of common members / size of smaller set)

Desired Output

## # A tibble: 3 × 4
##   combination n_instersected_members size_of_smaller_set  overlap_coeff
##   <chr>                        <dbl>               <dbl>          <dbl>
## 1 foo*bar                          2                   5           0.4 
## 2 foo*baz                          3                   4           0.75
## 3 bar*baz                          2                   4           0.5 

Is there a simple enough way to get this done with dplyr verbs? I've tried

df_example |> 
  group_by(my_group) |> 
  summarise(intersected = dplyr::intersect(cities))

But this won't work, obviously, because dplyr::intersect() expects two vectors. Is there a way to get to the desired output similar to my dplyr direction?

Legation answered 2/10, 2023 at 12:5 Comment(0)
S
4

Here is a base R option using combn

do.call(
    rbind,
    combn(
        with(
            df_example,
            split(cities, my_group)
        ),
        2,
        \(x)
        transform(
            data.frame(
                combo = paste0(names(x), collapse = "-"),
                nrIntersect = sum(x[[1]] %in% x[[2]]),
                szSmallSet = min(lengths(x))
            ),
            olCoeff = nrIntersect / szSmallSet
        ),
        simplify = FALSE
    )
)

which gives

    combo nrIntersect szSmallSet olCoeff
1 bar-baz           2          4     0.5
2 bar-foo           2          5     0.4
3 baz-foo           2          4     0.5
Shattuck answered 2/10, 2023 at 12:20 Comment(0)
A
2

Another way to organize the data would be to use a tabular form (we can use a sparse Matrix to save memory if needed):

#library(Matrix)
tab = xtabs( ~ cities + my_group, df_example, sparse = TRUE) 

Then, all other variables can be calculated as:

n_intersected_members = crossprod(tab)
size_of_smaller_set = outer(cs <- colSums(tab), cs, pmin)
overlap_coeff = n_intersected_members / size_of_smaller_set
#overlap_coeff
#3 x 3 Matrix of class "dsyMatrix"
#    bar baz foo
#bar 1.0 0.5 0.4
#baz 0.5 1.0 0.5
#foo 0.4 0.5 1.0 

And retrieve the lower.tri of each object if needed.

Adda answered 3/10, 2023 at 7:19 Comment(2)
Nice approach, +1! BTW, table(df_example) should be more concise and efficient.Shattuck
@Shattuck : Agree, table is straightforward. I, deliberately, went with xtabs to store in a sparse matrixAdda

© 2022 - 2024 — McMap. All rights reserved.