tr1::unordered_set union and intersection
Asked Answered
A

3

24

How to do intersection and union for sets of the type tr1::unordered_set in c++? I can't find much reference about it.

Any reference and code will be highly appreciated. Thank you very much.

Update: I just guessed the tr1::unordered_set should provide the function for intersection, union, difference.. Since that's the basic operation of sets. Of course I can write a function by myself, but I just wonder if there are built in function from tr1. Thank you very much.

Actinology answered 22/5, 2009 at 2:31 Comment(0)
K
17

I see that set_intersection() et al. from the algorithm header won't work as they explicitly require their inputs to be sorted -- guess you ruled them out already.

It occurs to me that the "naive" approach of iterating through hash A and looking up every element in hash B should actually give you near-optimal performance, since successive lookups in hash B will be going to the same hash bucket (assuming that both hashes are using the same hash function). That should give you decent memory locality, even though these buckets are almost certainly implemented as linked lists.

Here's some code for unordered_set_difference(), you can tweak it to make the versions for set union and set difference:

template <typename InIt1, typename InIt2, typename OutIt>
OutIt unordered_set_intersection(InIt1 b1, InIt1 e1, InIt2 b2, InIt2 e2, OutIt out) {
    while (!(b1 == e1)) {
        if (!(std::find(b2, e2, *b1) == e2)) {
            *out = *b1;
            ++out;
        }

        ++b1;
    }

    return out;
}

Assuming you have two unordered_sets, x and y, you can put their intersection in z using:

unordered_set_intersection(
    x.begin(), x.end(),
    y.begin(), y.end(),
    inserter(z, z.begin())
);

Unlike bdonlan's answer, this will actually work for any key types, and any combination of container types (although using set_intersection() will of course be faster if the source containers are sorted).

NOTE: If bucket occupancies are high, it's probably faster to copy each hash into a vector, sort them and set_intersection() them there, since searching within a bucket containing n elements is O(n).

Kristlekristo answered 22/5, 2009 at 2:31 Comment(8)
+1 Excellent answer. It would be interesting to benchmark this code. It might actually be faster (if the sets are bigger but not too big) to copy them into a sorted set and run std::set_intersection().Matlock
Thanks ceretullis. Yes, I suspect that would be faster if the buckets have high occupancy, though in that case I suspect copying them to vectors and sorting those will be faster still, just because there is less memory overhead and no pointer chasing involved. (Sorting a vector and creating a sorted set are both O(nlog n).)Kristlekristo
I'm a little concerned. Are we sure that std::find will work well with iterators into set? Won't the find simply iterate through every element in the second set, whereas we want it to use the hash to loopup? Shouldn't the function simply take a reference to the set object and then use the .count method?Vair
Good point @AaronMcDaid. I assumed that containers that allow fast lookups, like set and unordered_set, would provide template overloads for std::find(), but certainly that's not mentioned in the (old) standard or any documentation I could find online, which require only basic InputIterators and so will take linear time. An intelligent std::find() would only be possible if knowing an iterator pair is enough to enable the fast lookup, and that's probably not the case...Kristlekristo
... Following your suggestion, it would be better to write a free template function intelligently_find<T>() that takes a reference to the container (instead of an iterator pair), give it overloads for containers that allow fast lookups, and let it otherwise fall back on std::find().Kristlekristo
-1 to this answer: I've done some experiments to confirm that std :: find is slow, and hence I'm upvoting @bdonlan's answer instead. ideone.com/Lr64p (Thanks @j_random_hacker)Vair
Here's a set_union for unordered_set: std::copy( std::begin(uset2), std::end(uset2), std::inserter( uset1, std::end(uset1) ) );Drive
"If bucket occupancies are high" then you have a performance issue not just for union/intersection; maybe your hash function is bad?Katelin
P
14

There's nothing much to it - for intersect, just go through every element of one and ensure it's in the other. For union, add all items from both input sets.

For example:

void us_isect(std::tr1::unordered_set<int> &out,
        const std::tr1::unordered_set<int> &in1,
        const std::tr1::unordered_set<int> &in2)
{
    out.clear();
    if (in2.size() < in1.size()) {
        us_isect(out, in2, in1);
        return;
    }
    for (std::tr1::unordered_set<int>::const_iterator it = in1.begin(); it != in1.end(); it++)
    {
        if (in2.find(*it) != in2.end())
            out.insert(*it);
    }
}

void us_union(std::tr1::unordered_set<int> &out,
        const std::tr1::unordered_set<int> &in1,
        const std::tr1::unordered_set<int> &in2)
{
    out.clear();
    out.insert(in1.begin(), in1.end());
    out.insert(in2.begin(), in2.end());
}
Parvati answered 22/5, 2009 at 2:31 Comment(2)
You can speed up the case of intersecting a big set with a small one by iterating the small one and testing membership in the big one.Charlacharlady
In us_union, doing out = in1; should be more efficient than clearing and inserting from an iterator range, because there is no need to test for duplicates on insertion. In us_isect the out.clear() could go after the condition that checks for the smaller container, because there's no need to clear it twice. I would simply use in2.count(*it) instead of using in2.find(*it) != in2.end()Filter
H
3

based on the previous answer: C++11 version, if the set supports a fast look up function find() (return values are moved efficiently)

/** Intersection and union function for unordered containers which support a fast lookup function find()
 *  Return values are moved by move-semantics, for c++11/c++14 this is efficient, otherwise it results in a copy
 */

namespace unorderedHelpers {

    template<typename UnorderedIn1, typename UnorderedIn2,
             typename UnorderedOut = UnorderedIn1>
    UnorderedOut makeIntersection(const  UnorderedIn1 &in1, const  UnorderedIn2 &in2)
    {
        if (in2.size() < in1.size()) {
            return makeIntersection<UnorderedIn2,UnorderedIn1,UnorderedOut>(in2, in1);
        }

        UnorderedOut out;
        auto e = in2.end();
        for(auto & v : in1)
        {
            if (in2.find(v) != e){
                out.insert(v);
            }
        }
        return out;
    }

    template<typename UnorderedIn1, typename UnorderedIn2,
             typename UnorderedOut = UnorderedIn1>
    UnorderedOut makeUnion(const UnorderedIn1 &in1, const UnorderedIn2 &in2)
    {
        UnorderedOut out;
        out.insert(in1.begin(), in1.end());
        out.insert(in2.begin(), in2.end());
        return out;
    }
}
Hue answered 22/5, 2009 at 2:31 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.