unordered set intersection in C++ [duplicate]
Asked Answered
A

4

9

Here is my code, wondering any ideas to make it faster? My implementation is brute force, which is for any elements in a, try to find if it also in b, if so, put in result set c. Any smarter ideas is appreciated.

#include <iostream>
#include <unordered_set>

int main() {
    std::unordered_set<int> a = {1,2,3,4,5};
    std::unordered_set<int> b = {3,4,5,6,7};
    std::unordered_set<int> c;
    for (auto i = a.begin(); i != a.end(); i++) {
        if (b.find(*i) != b.end()) c.insert(*i);
    }
    for (int v : c) {
        std::printf("%d \n", v);
    }
}
Ambur answered 20/12, 2017 at 8:2 Comment(7)
Questions asking for improvement of already working code should go to SE Code Review.Jumble
C++ have many nice algorithmic functions in the standard library. For example one called std::set_intersection.Analysis
@Someprogrammerdude ... which requires sorted input ranges, something std::unoredered_set does not readily provide.Kenway
@Someprogrammerdude, yes, agree with Angew, I looked at this API and I think sorting requires additional time.Ambur
@Jumble Code that doesn't meet performance requirements is not working.Undercover
@user0042: Better algorithms are on-topic here.Ere
@n.m., nice catch! Totally agree.Ambur
K
10

Asymptotically, your algorithm is as good as it can get.

In practice, I'd add a check to loop over the smaller of the two sets and do lookups in the larger one. Assuming reasonably evenly distributed hashes, a lookup in a std::unoredered_set takes constant time. So this way, you'll be performing fewer such lookups.

Kenway answered 20/12, 2017 at 8:20 Comment(1)
Thanks Angew, why your method is faster? Could you elaborate a bit more?Ambur
E
5

You can do it with std::copy_if()

std::copy_if(a.begin(), a.end(), std::inserter(c, c.begin()), [b](const int element){return b.count(element) > 0;} );
Electrograph answered 7/12, 2018 at 15:29 Comment(0)
S
3

Your algorithm is as good as it gets for a unordered set. however if you use a std::set (which uses a binary tree as storage) or even better a sorted std::vector, you can do better. The algorithm should be something like:

  1. get iterators to a.begin() and b.begin()
  2. if the iterators point to equal element add to intersection and increment both iterators.
  3. Otherwise increment the iterator pointing to the smallest value
  4. Go to 2.

Both should be O(n) time but using a normal set should save you from calculating hashes or any performance degradation that arises from hash collisions.

Sakti answered 20/12, 2017 at 8:59 Comment(4)
Why do you think set is faster then unordered set in my use case?Ambur
Save on hash calculation and true O(1) iteration without dealing with collisions. But benchmark if not sure.Sakti
@Sakti depends on how fast hash calculation actually is - but if numbers of elements differ largely, hash sets gain advantage again (provided one selects the smaller one as reference). We'd now have to know the use cases, tree sets might come with trade-offs at other locations (random element access). Still good to show up an alternative...Serviceable
Note that step 3 can be tweaked a bit: instead of incrementing just once, look for the first element not smaller than the first of the other list. Phrased this way, it gets clearer that you can take advantage of random access (galloping strategy) or tree structure, to skip some elements. That does not affect the worst case, but it can matter in practice.Strickland
S
2

Thanks Angew, why your method is faster? Could you elaborate a bit more?

Well, let me provide you some additional info...

It should be pretty clear that, whichever data structures you use, you will have to iterate over all elements in at least one of those, so you cannot get better than O(n), n being the number of elements in the data structure selected to iterate over. Elementary now is, how fast you can look up the elements in the other structure – with a hash set, which std::unordered_set actually is, this is O(1) – at least if the number of collisions is small enough ("reasonably evenly distributed hashes"); the degenerate case would be all values having the same key...

So far, you get O(n) * O(1) = O(n). But you still the choice: O(n) or O(m), if m is the number of elements in the other set. OK, in complexity calculations, this is the same, we have a linear algorithm anyway, in practice, though, you can spare some hash calculations and look-ups if you choose the set with the lower number of elements...

Serviceable answered 20/12, 2017 at 8:20 Comment(4)
Nice catch! Thanks.Ambur
Note that this is a worst case analysis. Depending on what the sets typically look like in this application, intersecting sorted sets may or may not be possible in sub-linear time (think of the extreme case where you intersect [0,100] with [200,300], one comparison is enough to notice that the result is empty).Strickland
@MarcGlisse Ah, back at doron's idea... (my answer - actually addition to a previous one - being about hash sets...). But we'd have to modify doron's algorithm to profit from such cases (as a whole or for partial ranges), maybe some clever partition algorithm?Serviceable
For the ordered case, some references can be found at gcc.gnu.org/bugzilla/show_bug.cgi?id=66418 .Strickland

© 2022 - 2024 — McMap. All rights reserved.