unordered set intersection in C++ [duplicate]

Asked 20/12, 2017 at 8:2 Answered 7/12, 2018 at 15:29

Here is my code, wondering any ideas to make it faster? My implementation is brute force, which is for any elements in a, try to find if it also in b, if so, put in result set c. Any smarter ideas is appreciated.

#include <iostream>
#include <unordered_set>

int main() {
    std::unordered_set<int> a = {1,2,3,4,5};
    std::unordered_set<int> b = {3,4,5,6,7};
    std::unordered_set<int> c;
    for (auto i = a.begin(); i != a.end(); i++) {
        if (b.find(*i) != b.end()) c.insert(*i);
    }
    for (int v : c) {
        std::printf("%d \n", v);
    }
}

Ambur answered 20/12, 2017 at 8:2 Comment(7)

Questions asking for improvement of already working code should go to SE Code Review. – Jumble 20/12, 2017 at 8:4

C++ have many nice algorithmic functions in the standard library. For example one called std::set_intersection. – Analysis 20/12, 2017 at 8:5

@Someprogrammerdude ... which requires sorted input ranges, something std::unoredered_set does not readily provide. – Kenway 20/12, 2017 at 8:16

@Someprogrammerdude, yes, agree with Angew, I looked at this API and I think sorting requires additional time. – Ambur 20/12, 2017 at 8:23

@Jumble Code that doesn't meet performance requirements is not working. – Undercover 20/12, 2017 at 8:53

@user0042: Better algorithms are on-topic here. – Ere 20/12, 2017 at 10:46

@n.m., nice catch! Totally agree. – Ambur 20/12, 2017 at 23:11

Asymptotically, your algorithm is as good as it can get.

In practice, I'd add a check to loop over the smaller of the two sets and do lookups in the larger one. Assuming reasonably evenly distributed hashes, a lookup in a std::unoredered_set takes constant time. So this way, you'll be performing fewer such lookups.

Kenway answered 20/12, 2017 at 8:20 Comment(1)

Thanks Angew, why your method is faster? Could you elaborate a bit more? – Ambur 20/12, 2017 at 8:23

You can do it with std::copy_if()

std::copy_if(a.begin(), a.end(), std::inserter(c, c.begin()), [b](const int element){return b.count(element) > 0;} );

Electrograph answered 7/12, 2018 at 15:29 Comment(0)

Your algorithm is as good as it gets for a unordered set. however if you use a std::set (which uses a binary tree as storage) or even better a sorted std::vector, you can do better. The algorithm should be something like:

get iterators to a.begin() and b.begin()
if the iterators point to equal element add to intersection and increment both iterators.
Otherwise increment the iterator pointing to the smallest value
Go to 2.

Both should be O(n) time but using a normal set should save you from calculating hashes or any performance degradation that arises from hash collisions.

Sakti answered 20/12, 2017 at 8:59 Comment(4)

Why do you think set is faster then unordered set in my use case? – Ambur 20/12, 2017 at 23:12

Save on hash calculation and true O(1) iteration without dealing with collisions. But benchmark if not sure. – Sakti 21/12, 2017 at 16:27

@Sakti depends on how fast hash calculation actually is - but if numbers of elements differ largely, hash sets gain advantage again (provided one selects the smaller one as reference). We'd now have to know the use cases, tree sets might come with trade-offs at other locations (random element access). Still good to show up an alternative... – Serviceable 28/12, 2017 at 12:11

Note that step 3 can be tweaked a bit: instead of incrementing just once, look for the first element not smaller than the first of the other list. Phrased this way, it gets clearer that you can take advantage of random access (galloping strategy) or tree structure, to skip some elements. That does not affect the worst case, but it can matter in practice. – Strickland 17/1, 2018 at 12:8

Thanks Angew, why your method is faster? Could you elaborate a bit more?

Well, let me provide you some additional info...

It should be pretty clear that, whichever data structures you use, you will have to iterate over all elements in at least one of those, so you cannot get better than O(n), n being the number of elements in the data structure selected to iterate over. Elementary now is, how fast you can look up the elements in the other structure – with a hash set, which std::unordered_set actually is, this is O(1) – at least if the number of collisions is small enough ("reasonably evenly distributed hashes"); the degenerate case would be all values having the same key...

So far, you get O(n) * O(1) = O(n). But you still the choice: O(n) or O(m), if m is the number of elements in the other set. OK, in complexity calculations, this is the same, we have a linear algorithm anyway, in practice, though, you can spare some hash calculations and look-ups if you choose the set with the lower number of elements...

Serviceable answered 20/12, 2017 at 8:20 Comment(4)

Nice catch! Thanks. – Ambur 20/12, 2017 at 23:12

Note that this is a worst case analysis. Depending on what the sets typically look like in this application, intersecting sorted sets may or may not be possible in sub-linear time (think of the extreme case where you intersect [0,100] with [200,300], one comparison is enough to notice that the result is empty). – Strickland 17/1, 2018 at 11:30

@MarcGlisse Ah, back at doron's idea... (my answer - actually addition to a previous one - being about hash sets...). But we'd have to modify doron's algorithm to profit from such cases (as a whole or for partial ranges), maybe some clever partition algorithm? – Serviceable 17/1, 2018 at 12:3

For the ordered case, some references can be found at gcc.gnu.org/bugzilla/show_bug.cgi?id=66418 . – Strickland 17/1, 2018 at 12:11

Recommended topics

Hot tags