Algorithm / Data structure for largest set intersection in a collection of sets with a given set

Asked 30/7, 2015 at 23:37 Answered 31/7, 2015 at 12:58

algorithm data-structures set intersection set-intersection

I have a large collection of several million sets, C. The elements of my sets come from a universe of about 2000 possible elements. I need to know, for a given set, s, which set in C has the largest intersection with s? (Or the k sets in C with the k-largest intersections). I will be making many of these queries, sequentially, for different s.

I know that the obvious way to do this is to just to loop over every set in C and compute the intersection and take the max. Are there any smart data structures / programming tricks that can speed up my search? It would be great if I could do this faster than O(C).

EDIT: approximate answers would be alright too

Positivism answered 30/7, 2015 at 23:37 Comment(10)

If you want to do this to compute the most "similar" set, MinHash is sure a way to go. – Molli 31/7, 2015 at 0:2

What are the maximum and average sizes of these sets? Is there an upper bound on the number of sets that an element can belong to? Is each element in each set with (close to) some fixed probability, or are some elements in many sets and some in very few? – Gaelan 31/7, 2015 at 0:13

Max size is 2000. An element can belong to any number of sets. The probability of each element is definitely not uniform. – Positivism 31/7, 2015 at 0:24

OK, but are the sets mostly size 3, or mostly size 1000? – Gaelan 31/7, 2015 at 0:24

Average size would probably hover around 1000 – Positivism 31/7, 2015 at 0:29

Is every element inside the sets unique or can they repeat? Do you count repeating elements as more intersections or as a single intersection? – Crepuscule 31/7, 2015 at 7:38

I think the answers given until now are as good as it gets. Further options might exist, though, maybe something stochastical. Does the Database change over time, or could you derive some fixed helper structures (once!) based on its contents? – Photocomposition 31/7, 2015 at 11:24

How about bloom filters? "Bloom filters can be used to approximate the size of the intersection and union of two sets" en.wikipedia.org/wiki/… – Stringboard 31/7, 2015 at 13:38

@Crepuscule the elements are unique - no concept of elements repeating within the same set – Positivism 31/7, 2015 at 16:30

@Photocomposition helper structures once are fine – Positivism 31/7, 2015 at 16:30

I don't think there's a clever data structure that will help with asymptotic performance. But this is a perfect map reduce problem. A GPGPU would do nicely. For a universe of 2048 elements, a set as a bitmap is only 256 bytes. 4 million is only a gigabyte. Even a modestly spec'ed Nvidia has that. E.g. programming in CUDA, you'd copy C to graphics card RAM, map a chunk of the gigabyte to each GPU core for searching and then reduce across cores to find the final answer. This ought to take on the order of a very few milliseconds. Not fast enough? Just buy hotter hardware.

If you re-phrase your question along these lines, you'll probably get answers from experts in this kind of programming, which I'm not.

Blackman answered 31/7, 2015 at 3:17 Comment(0)

One simple trick is to sort the list of sets C in decreasing order by size, then proceed with brute force intersection tests as usual. As you go along, keep track of the set b with the biggest intersection so far. If you find a set whose intersection with the query set s has size |s| (or equivalently, has intersection equal to s -- use whichever of these tests is faster), you can immediately stop and return it as this is the best possible answer. Otherwise, if the next set from C has fewer than |b| elements, you can immediately stop and return b. This can easily be generalised to finding the top k matches.

Gaelan answered 31/7, 2015 at 5:9 Comment(0)

I don't see any way to do this in less than O(C) per query, but I have some ideas on how to maximize efficiency. The idea is basically to build a lookup table for each element. If some elements are rare and some are common, you can have positive and negative lookup tables:

s[i] // your query, an array of size 2 thousand, true/false
sign[i] // whether the ith element is positive/negative lookup. +/- 1
sets[i] // a list of all the sets that the ith element belongs/(doesn't) to

query(s):
  overlaps[i] // an array of size C, initialized to 0's
  for i in len(s):
    if s[i]:
      for j in sets[i]:
        overlaps[j] += sign[i]

  return max_index(overlaps)

Especially if many of your elements are of widely differing probabilities (as you said), this approach should save you some time: very rare or very common elements can be dealt with almost instantly.

To further optimize: you can sort the structure so that the elements that are most common/most rare are dealt with first. After you have done the first e.g. 3/4, you can do a quick pass to see if the closest matching set is so far ahead of the next set that it is not necessary to continue, though again whether that is worthwhile depends on the details of your data's distribution.

Yet another refinement: make sets[i] one of two possible structures: if the element is very rare or common, sets[i] is just a list of the sets that the ith element is in/not in. However, suppose the ith element is in half the sets. Then sets[i] is just a list of indices half as long as the number of sets, looping through it and incrementing overlaps is wasteful. Have a third value for sign[i]: if sign[i] == 0, then the ith element is relatively close to 50% commonality (this may just mean between 5% and 95%, or anything else), and instead of a list of sets in which it appears, it will simply be an array of 1's and 0's with length equal to C. Then you would just add the array in its entirety to overlaps which would be faster.

Crotchet answered 31/7, 2015 at 1:11 Comment(0)

Put all of your elements, from the million sets into a Hashtable. The key will be the element, the value will be a set of indexes that point to a containing set.

HashSet<Element>[] AllSets = ...

// preprocess
Hashtable AllElements = new Hashtable(2000);
for(var index = 0; index < AllSets.Count; index++) {
    foreach(var elm in AllSets[index]) {
        if(!AllElements.ContainsKey(elm)) { 
            AllElements.Add(elm, new HashSet<int>() { index });
        } else {
            ((HashSet<int>)AllElements[elm]).Add(index);
        }
    }
}

public List<HashSet<Element>> TopIntersect(HashSet<Element> set, int top = 1) {
    // <index, count>
    Dictionar<int, int> counts = new Dictionary<int, int>();
    foreach(var elm in set) {
        var setIndices = AllElements[elm] As HashSet<int>;
        if(setIndices != null) {
           foreach(var index in setIndices) {
               if(!counts.ContainsKey(index)) {
                   counts.Add(index, 1);
               } else {
                   counts[index]++;
               }
           } 
        }
    }
    return counts.OrderByDescending(kv => kv.Value)
        .Take(top)
        .Select(kv => AllSets[kv.Key]).ToList();
}

Torytoryism answered 31/7, 2015 at 12:58 Comment(0)

Recommended topics

Hot tags