Pairwise Set Intersection in Python
Asked Answered
A

3

16

If I have a variable number of sets (let's call the number n), which have at most m elements each, what's the most efficient way to calculate the pairwise intersections for all pairs of sets? Note that this is different from the intersection of all n sets.

For example, if I have the following sets:

A={"a","b","c"}
B={"c","d","e"}
C={"a","c","e"}

I want to be able to find:

intersect_AB={"c"}
intersect_BC={"c", "e"}
intersect_AC={"a", "c"}

Another acceptable format (if it makes things easier) would be a map of items in a given set to the sets that contain that same item. For example:

intersections_C={"a": {"A", "C"},
                 "c": {"A", "B", "C"}
                 "e": {"B", "C"}}

I know that one way to do so would be to create a dictionary mapping each value in the union of all n sets to a list of the sets in which it occurs and then iterate through all of those values to create lists such as intersections_C above, but I'm not sure how that scales as n increases and the sizes of the set become too large.

Some additional background information:

  1. Each of the sets are of roughly the same length, but are also very large (large enough that storing them all in memory is a realistic concern, and an algorithm which avoids that would be preferred though is not necessary)
  2. The size of the intersections between any two sets is very small compared to the size of the sets themselves
  3. If it helps, we can assume anything we need to about the ordering of the input sets.
Amick answered 9/12, 2014 at 0:13 Comment(13)
Have you tried the method that you know works?Stambaugh
I would suggest the following: Traverse all sets and build a map by tracking where you find each element. This is O(NlogN) (assuming that the dictionary adds a logarithmic overhead), where N is the total number of elements.Felting
I've tried the method I described on small samples, but the problem is that a lot of the data I will be using with this is user-fed. I would ideally like to be able to support much larger use cases, so I was wondering if there's a more common/efficient way to do this than the naive approach I described.Amick
@Felting Is your idea to traverse the sets and make a dictionary independently for all n sets, making the dictionary only of size m per iteration rather than nm* to store all possible elements?Amick
I think this can be done in linear time using a hash table, linear with respect to the size of the sets: O(N + M + N * c), where c is a constant that represents the cost of accessing an entry in the hash table, this constant will be proportional to the length of the strings in you sets.Udall
It can be done in constant time (assuming set subtraction is constant time)Epiphyte
I suggested to use a single dictionary for all sets, which would finally contain all elements and would therefore be of size O(n*m) if you have n sets of roughly the same size m. But you'll have to provide more information, if you want a more educated guess. For example, is it just the n*(n-1)/2 intersections that you're interested in? Can you afford an extra O(n*m) space? If yes, you can calculate them all in O(n*m*log(n*m)) time --- assuming that your intersections are small; worst case O(n*n*m*log(n*m)).Felting
@nickie: Is there a difference between your suggestion and what I mentioned as a possible strategy below intersections_C in my post? The only information that I care about at the end are the intersections of the sets. Unfortunately, I don't know too much about what my constraints are because I haven't been able to test this to scale just yet. Thanks for your help!Amick
I assumed that your intersections_C dictionary contains only the elements that belong in set C. I suggested one dictionary for the elements contained in the union of all your sets.Felting
Can we assume anything about the ordering of the set elements themselves?Thirst
@nickie: Ah ok, yeah I should have clarified my approach better -- I meant to create a dictionary for the union of all the sets, iterate through each set to populate the dictionary with which sets contain each element, and then iterate through the dictionary once more to create dictionaries such as intersections_C.Amick
@tzaman: If it helps, we can assume that the keys are sortedAmick
1. Approximately how large is n in practice? 2. If you were to take all of your sets and partition them randomly into two groups A and B, could (⋃A) ⋂ (⋃B) be large? In Python syntax, that's set().union(*A) & set().union(*B).Nevillenevin
D
8

this ought to do what you want

import random as RND
import string
import itertools as IT

mock some data

fnx = lambda: set(RND.sample(string.ascii_uppercase, 7))
S = [fnx() for c in range(5)]

generate an index list of the sets in S so the sets can be referenced more concisely below

idx = range(len(S))

get all possible unique pairs of the items in S; however, since set intersection is commutative, we want the combinations rather than permutations

pairs = IT.combinations(idx, 2)

write a function perform the set intersection

nt = lambda a, b: S[a].intersection(S[b])

fold this function over the pairs & key the result from each function call to its arguments

res = dict([ (t, nt(*t)) for t in pairs ])

the result below, formatted per the first option recited in the OP, is a dictionary in which the values are the set intersections of two sequences; each values keyed to a tuple comprised of the two indices of those sequences

this solution, is really just two lines of code: (i) calculate the permutations; (ii) then apply some function over each permutation, storing the returned value in a structured container (key-value) container

the memory footprint of this solution is minimal, but you can do even better by returning a generator expression in the last step, ie

res = ( (t, nt(*t)) for t in pairs )

notice that with this approach, neither the sequence of pairs nor the corresponding intersections have been written out in memory--ie, both pairs and res are iterators.

Dalenedalenna answered 9/12, 2014 at 1:10 Comment(7)
This takes O(n*n*m) time, if you have n sets of size m.Felting
the time complexity for intersection of two sets, x and y is O(len(x) * len(y)); unfavorable time complexity is inherent in this problem, so the best you can do is not make it worse and just address the constant time factors (eg, don't re-implement the underlying C looks instead use the optimized python functions, like list comprehensionsDalenedalenna
This definitely gets the job done, but it would be super cool if there was a way to use this simple syntax with a much more memory-efficient approach like tzaman described. Thanks!Amick
have you looked at this memory profile for this solution? nothing is written out in memory here. the pairs object is an iterator (python3) w/ just the set indices not the sets themselves; likewise, see my amended answer, last line can be changed slightly to return a generator expression, holding tuples rather than k:v, expression (vs. dict)Dalenedalenna
Ah nice! One more thing that I noticed is that we could cut the number of intersection calculations in half by using combinations rather than permutations of the setsAmick
Why aren't you using a dictionary comprehension (aka. res = {t: nt(*t) for t in pairs})?Nevillenevin
@Dalenedalenna The time complexity for the intersection of two sets is O(min(len x, len y)), not O(len x · len y). Further, one should be able to reduce that cost significantly if the intersections are known to be small and intermediates can be re-used.Nevillenevin
T
3

If we can assume that the input sets are ordered, a pseudo-mergesort approach seems promising. Treating each set as a sorted stream, advance the streams in parallel, always only advancing those where the value is the lowest among all current iterators. Compare each current value with the new minimum every time an iterator is advanced, and dump the matches into your same-item collections.

Thirst answered 9/12, 2014 at 1:27 Comment(2)
This was the next thing I was considering -- the dictionary idea (where the dictionary is keyed on the union of all n sets) seemed easier to conceptualize and implement, but I felt that intuitively this would save some time and memory consumption. Any idea how to quantify the savings that this approach has over the dictionary method?Amick
The primary advantage of a streaming approach is that you only need to have one item from each set in memory at a time. The dictionary approach uses far more than that: ~O(number of unique elements * average membership). You can even write out the intersections themselves to a set of files if necessary.Thirst
P
-4

How about using intersection method of set. See below:

A={"a","b","c"}
B={"c","d","e"}
C={"a","c","e"}

intersect_AB = A.intersection(B)
intersect_BC = B.intersection(C)
intersect_AC = A.intersection(C)

print intersect_AB, intersect_BC, intersect_AC
Plaintive answered 9/12, 2014 at 5:11 Comment(1)
The example that I gave was intended to be a generalized example (I'll have many more than just sets A, B, and C), and I wanted to avoid redoing work if possible because the size of my sets can potentially be enormous.Amick

© 2022 - 2024 — McMap. All rights reserved.