C++ library method for intersection of two unordered_set
Asked Answered
E

3

24

I have two unordered_set and want the intersection of those. I can't find a library function to do that.

Essentially, what I want is this:

unordered_set<int> a = {1, 2, 3};
unordered_set<int> b = {2, 4, 1};

unordered_set<int> c = a.intersect(b); // Should be {1, 2}

I can do something like

unordered_set<int> c;
for (int element : a) {
  if (b.count(element) > 0) {
    c.insert(element);
  }
}

However, I think there should be a more convenient way to do that. If there isn't one, can someone explain why? I know there is std::set_intersection, but that seems to operate on vectors only.

Enteric answered 8/1, 2018 at 22:13 Comment(2)
I sure don't understand why you think set_intersection is limited to vectors only, it takes two ranges as input iterators and an output iterator. Just about any standard container should be able to satisfy those requirements.Creatinine
Doing the simple loop approach is O(n) with an unordered_set. You sould use find instead of count though. @Creatinine set_intersection needs a sorted set.Normand
E
26

In fact, a loop-based solutions is the best thing you can use with std::unordered_set.

There is an algorithm called std::set_intersection which allows to find an intersection of two sorted ranges:

Constructs a sorted range beginning at d_first consisting of elements that are found in both sorted ranges [first1, last1) and [first2, last2).

As you deal with std::unordered_set, you cannot apply this algorithm because there is no guaranteed order for the elements in std::unordered_set.

My advice is to stick with loops as it explicitly says what you want to achieve and has a linear complexity (O(N), where N is a number of elements in the unordered set you traverse with a for loop) which is the best compexity you might achieve.

Emelinaemeline answered 8/1, 2018 at 22:21 Comment(3)
So the complexity is O(min(len(s), len(l))) where s and l are the sets being intersected. Obviously you want to make N=min(len(s), len(l)) and not max as almost all libraries do such as in Python's set object. Just clarifying since sometimes these details are forgotten.Transitory
@GregoryMorse: where did you get that info about Python set.__and__ being max of operands lengths? obviously it's not trueKoeninger
This answer lacks the important detail already pointed out by @GregoryMorse : "the unordered set you traverse with a for loop" should be the smaller of the two unordered sets.Kirakiran
D
0

As the accepted answer explains, there is nothing that directly does what you want, and std::set_intersection isn't intended for intersection std::unordered_sets, even though it sounds like it. The ideal solution depends on the C++ standard you're using.

Given two sets std::unordered_set<T> a, b; ...

In-place

... the best method to turn a into the intersection a & b is as follows:

// C++20 (needs <unordered_set>)
std::erase_if(a, [&b](int e) {
    return !b.contains(e);
});

// C++11 (needs <unordered_set>)
for (auto it = a.begin(); it != a.end();) {
    if (!b.count(*it)) 
        it = a.erase(it);
    else               
        ++it;
}

Non-modifying

Alternatively, to compute a new set c which contains the intersection of a and b:

// C++23 (needs <unordered_set>, <ranges>)
std::unordered_set<int> c(std::from_range, a | std::views::filter([&b](int e) {
    return b.contains(e);
});

// C++20 (needs <unordered_set>, <ranges>)
auto view = a | std::views::filter([&b](int e) {
    return b.contains(e);
};
std::unordered_set<int> c(view.begin(), view.end());

// C++11 (needs <unordered_set>)
std::unordered_set<int> c;
// TODO: consider reserving a size (optimistically |a| + |b|)
for (int e : a) {
    if (b.count(e)) { c.insert(e); }
}
Davon answered 25/8, 2023 at 13:29 Comment(0)
G
-5

There is a function from std called set_intersection. However, it would have a very high complexity using it with std::set as input parameter.. A better solution is, create two vectors from those sets and use set_intersection with vectors as input parameters.

Grabowski answered 8/1, 2018 at 22:26 Comment(20)
doesn't set_intersection require a sorted input?Eugenioeugenius
"However, it would have a very high complexity using it with std::set as input parameter." - Would you mind expanding on that?Unijugate
If the values are already in std::unordered_set there is no more efficient approach then the simple loop. Your suggestion is just more complicated and worse performance.Normand
@Normand Moving the value to vectors, sorting the vector, performing the intersection might be faster than the two above loops. You'd have to benchmark.Unijugate
@Unijugate Not really. Using a loop + find gives O(n) time. Using 2 sorted vectors would also give O(n) time. But if you have to make the vectors AND sort them there is no way for that method to be faster. Please keep in mind that std::unordered_set has O(n) lookup unless the set is gigantic.Normand
@Normand You are talking about theoretical asymptotic complexities. Vectors are known to perform faster than most containers for most operations, even faster at performing tasks dedicated to containers such as sorted insertion. You cannot assume that this approach will be slower unless you actually benchmark it. See e.g. lemire.me/blog/2017/01/27/….Unijugate
@Unijugate Sure, in principle you are right about the fact that you need to benchmark something to really know. But here we are talking about comparing walking and running. In principle you would need to test... but in practice.Normand
@Normand Not really walking vs running. I just did some quick tests with clang, -O3, and sets with 300000 elements and both approaches have the same running time (within 2%). I used random data, who knows what could happens using data with specific distribution...Unijugate
@Unijugate Surely you are not including creating + sorting the vectors in the running time?Normand
@Normand You should have a look at a Bjarne Stroustrup's talk (I think it was him, not 100% sure, maybe Herb Sutter... ) at cppcon (can't remember which) where he spoke about vector performance vs other standard containers.Unijugate
@Normand Of course I am including creating and sorting the vector.Unijugate
Let us continue this discussion in chat.Normand
@Normand I would be happy to continue this discussion but it's 1 a.m here so I'll pass for tonight. But I took the time to reboot my computer to upload this benchmark gist if you want to try it out yourself: gist.github.com/anonymous/bee65aa443ce6e35ae82fc3e5e89cca1 I get better performance with the non-vector version, but as you will notice, it is not as obvious as it would seem. I actually compiled with g++ (TDM on Windows), I am so used to use clang at work that I wrote clang in my previous comment...Unijugate
I made one too The loop seems to end up somewhere around 55-65% of the intersect time on my machine with both our tests.Normand
@Normand how many elements did you test with? Complexity differences are of course sensitive to dataset size. Complexity can only trump all other considerations when you know your data can be of infinite size.Nerti
@MarkRansom 10000, 100000 and 1000000.Normand
@Unijugate If I modify your benchmark to save the intersection results in a vector instead of a set, the loop version is almost 5x faster.Normand
@Normand true, but if you convert it back to an unordered_set, the loop version is only twice as fast as the vector version. Even if the loop version is 5 times faster than the vector one, this is not what you'd expect with theoretical complexity (log2(1e6) ~ 20) so you'd expect the loop version to be 20 times faster (due to the sort complexity), not counting memory allocation. Anothing thing to mention is that with the vector version, the result is already sorted, so you could use it directly without having to sort it. [...]Unijugate
@Normand [...] My point is that you cannot simply look at the theoretical complexities of the operations on the containers to choose the best one, you have to consider how the container is used in the algorithm. If you are not looking for performance but for simplicity, then set and unordered_set might be great, but when I seek performance, I often try to do as much as possible with vectors, because vectors are much faster due to their memory layout and much easier to optimize for the compiler. [...]Unijugate
@Normand [...] Even if the loop version end up winning here, I hope that by now you get my original point! BTW, I get better result with vector using your version, 100000 elements and -O3 on clang (strange, isn't it?). Also, I'd like to point out that our benchmarks are quite poors since the two original sets tends to have an empty intersection, which is quite a specific here.Unijugate

© 2022 - 2024 — McMap. All rights reserved.