Fastest Set Operations In The West
Asked Answered
F

6

19

I haven't been able to find any satisfactory coverage of this topic all in one place, so I was wondering: What are the fastest set intersect, union, and disjoin algorithms?
Are there any interesting ones with limited domains?
Can anyone beat O(Z) where Z is the actual size of intersection?

If your approach relies on sorted sets, please note that, but don't consider it a disqualifying factor. It seems to me that there must be a veritable storehouse of subtle optimizations to be shared, and I don't want to miss any of them.

A few algorithms I know rely on bitwise operations beyond the vanilla, so you may assume the presence of SSE4 and access to intrinsics like popcount. Please note this assumption.

Of interest: An Implementation of B-Y Intersect

Update
We've got some really good partial answers, but I'm still hoping for some more complete attacks on the problem. I'm particularly interested in seeing a more fully articulated use of bloom filters in attacking the problem.

Update
I've done some preliminary work on combining bloom filters with a cuckoo hash table. It's looking almost obnoxiously promising, because they have very similar demands. I've gone ahead and accepted an answer, but I'm not really satisfied at the moment.

Faison answered 23/11, 2010 at 22:27 Comment(3)
It depends somewhat on the size of the sets.Ytterbite
Naturally! I did mention domain limitations. I'd still love to see your fast small-set intersect. I guess a good operating presumption is that the smaller of the two sets must have size > 1 for the algorithm to be of interest.Faison
Yeah ... at least > 1 :PSubaquatic
U
4

If you're willing to consider set-like structures then bloom filters have trivial union and intersect operations.

Uintathere answered 23/11, 2010 at 23:43 Comment(8)
Tell me more? I've read the wiki-article, and I'm interested but not convinced that bloom filters have truly trivial intersect operations, and many of the referenced papers are behind pay-walls.Faison
Bloom filters does not change the Big O of the operation, they just move the operations to the hardware level; and if the size of the bloom filter is larger than the hardware's word size, both union and intersect will revert to O(n).Jitter
@Lie Not always. I can write a function that "unions" two bloom filters by simply altering the member function to be something like memberOf a x || memberOf b x. Thus the actual "union" operation was cheap but the member function becomes a layering (same with intersect). This is great if you know you'll only union a limited number of times per filter.Uintathere
Can't think of a lot of cases where you wouldn't know that, actually... This is really clever, is there a library implementation of bloom filters anywhere?Faison
@TomMD: as I already said in @Edmund's comment thread, redefining memberOf does not construct a Set, so it will only have the same semantic as a true Set if a and b is immutable. And you don't need a bloom filter to redefine memberOf as memberOf a x || memberOf b x, so I fail to relate your bloom filter answer with your memberOf redefinition answer.Jitter
@Lie First take note I'm not disagreeing with your last statement, redefining memberOf isn't limited to bloomfilters but it was applicable to your comment. WRT immutability, understand that I think of immutable as the default not the exception (Haskeller). Perhaps a better way to respond would have been noting the O(n) cost of union on bloom filters refers to a different n than the O(n) cost of most sets (e.g. size of the bloom filter instead of number of members) and because the constant size is likely much smaller you'd net real performance wins with memory accesses / cache misses.Uintathere
@Jake Sure. I've not had an issue with BOS's bloomfilter library.Uintathere
Blooms have to be considered in a certain context but where one is talking about LARGE data sets (pushing memory limits) and the domain can live with Bloom limitations (false-positive rate, contains(), union() and add() methods from a Set interface perspective), they work well for certain things like caches to avoid database look-ups or other high I/O operations. Otherwise, if memory is not a limiting factor, Bloom limitations are generally outweighed by the functionality and relative speed of other Set implementations.Deceitful
G
3

For reasonably dense sets, interval lists can beat O(n) for the operations you specify, where n is the number of elements in the set.

An interval list is essentially a strictly increasing list of numbers, [a1, b1, a2, b2, ..., an, bn], where each pair ai, bi denotes the interval [ai, bi). The strictly increasing constraint ensures that every describable set has a unique representation. Representing a set as an ordered collection of intervals allows your set operations to deal with multiple consecutive elements on each iteration.

Gorgonian answered 24/11, 2010 at 4:58 Comment(4)
Oh man! That's a really slick trick. I hadn't even thought to use interval lists. I may be able to optimize this even further.Faison
I'm not sure this is actually better than O(n). It's still O(m+n) where m,n is the number of intervals in set A,B respectively. Though this do allow a whole interval to be compared, which is an speed improvement; but I doubt that it actually changes the complexity class.Jitter
It isn't better than O(n) (i.e., worst case), but its expected behaviour is often much better when your sets contain continguous elements, hence the caveat about reasonably dense sets.Gorgonian
It does require that the total ordering of your elements permit a conception of immediately adjacent elements. For example, it'd work with ints but not arbitrary precision reals. Interesting thing, I think.Faison
A
2

If set is actually a hashed set and both sets have the same hash function and table size then we can skip all buckets that exist only in one set. That could narrow search a bit.

Achieve answered 23/11, 2010 at 22:52 Comment(5)
If both our sets are in p-tries, this becomes extremely interesting and very efficient. I'll give you a hint, it relies on a radix of 5 bits or 6 bits, and the presence of the popcount instruction. :) Of course, p-tries make certain demands on the data in them, but I think you could get the intersect\union\disjoint operations even if your keys had no natural lexical ordering. They'd still need to be able to generate some kind of bitstring for a key though.Faison
Hmm, I just noticed that Z is the size of intersection. My kung fu is not strong enough to beat that.Achieve
Well, I'd still like to see whatcha got. :)Faison
this is a similar idea as bloom filterJitter
Similar, but a slightly different approach. The thing is that it's still all about sort-once-act-often, and relies on a specialized data structure on top of that. But it can get you pretty close to O(Z) if you have popcount.Faison
A
2

The following paper presents algorithms for union, intersection and difference on ordered sets that beat O(Z) if the intersection is larger than the difference (Z > n/2):

Confluently Persistent Sets and Maps

Abbey answered 26/1, 2013 at 16:18 Comment(0)
C
1

there is no optimal solution than O(Z) because if you think of the problem logically each of the intersect, union and disjoin algorithms must at least read all of the input elements once, so Z reads is a must. also since the set is not sorted by default, no further optimizations could beat O(Z)

Chinookan answered 23/11, 2010 at 22:37 Comment(5)
I would naturally agree in the general case, but I suspect there exist domain specific algorithms with a performance characteristic superior to O(Z). I'm going to add a clause about sorted\unsorted, because I think that's an interesting divide in the approaches to the problem. I can think of a lot of cases where you have data that is sorted-once-queried-often, or is maintained in something like a b-tree or p-trie. I'd go so far as to suggest that these are meaningful subdomains.Faison
@Jake Kurzer: the only way to have a hope of getting an algorithm better than O(n) is when you don't need to read over all the elements. I don't think it is possible to devise an "intersection", "union", "disjoin" algorithm that does not to read all elements of at least one set.Jitter
@Lie Ryan: The B-Y algorithm linked can (conditionally) do far better than O(n), but relies on the data being sort-once-operate-often or being in some self-sorting structure that makes the cost of maintaining sorted order minimal. I can think of a couple others, though they lack the reference implementation, and rely on specialized data structures.Faison
@Jake Kurzer: The B-Y algorithm reduces the number of comparisons, B-Y does not do better than O(n) insertion**/**deletion.Jitter
How do you figure? As I grasp it, B-Y only copies elements found to actually be in the intersection. I'd seriously adore a solid refutation!Faison
P
0

Abstractly, a set is something that supports an operation, "is X a member?". You can define that operation on the intersection A n B in terms of it on A and B. An implementation would look something like:

interface Set { bool isMember(Object X); };

class Intersection {
    Set a, b;
    public Intersection(Set A, Set B) { this.a = A; this.b = B; }

    public isMember(Object X) {
        return a.isMember(X) and b.isMember(Y);
    }
}

A and B could be implemented using an explicit set type, like a HashSet. The cost of that operation on each is quite cheap, let's approximate it with O(1); so the cost on the intersection is just 2 O(n). ;-)

Admittedly if you build a large hierarchy of intersections like this, checking for a member can be more expensive, up to O(n) for n sets in the hierarchy. A potential optimisation for this could be to check the depth of the hierarchy against a threshold, and materialise it into a HashSet if it exceeds it. This will reduce the member operation cost, and perhaps amortise the construction cost when many intersections are applied.

Pissed answered 24/11, 2010 at 0:1 Comment(2)
return a.isMember(X) and b.isMember(Y); should, I think, be: return a.isMember(X) and b.isMember(X);Faison
This only have the same semantic as intersect(a, b) when a,b are immutable.Jitter

© 2022 - 2024 — McMap. All rights reserved.