C++ library method for intersection of two unordered_set

Asked 8/1, 2018 at 22:13 Answered 25/8, 2023 at 13:29

I have two unordered_set and want the intersection of those. I can't find a library function to do that.

Essentially, what I want is this:

unordered_set<int> a = {1, 2, 3};
unordered_set<int> b = {2, 4, 1};

unordered_set<int> c = a.intersect(b); // Should be {1, 2}

I can do something like

unordered_set<int> c;
for (int element : a) {
  if (b.count(element) > 0) {
    c.insert(element);
  }
}

However, I think there should be a more convenient way to do that. If there isn't one, can someone explain why? I know there is std::set_intersection, but that seems to operate on vectors only.

Enteric answered 8/1, 2018 at 22:13 Comment(2)

I sure don't understand why you think set_intersection is limited to vectors only, it takes two ranges as input iterators and an output iterator. Just about any standard container should be able to satisfy those requirements. – Creatinine 8/1, 2018 at 22:21

Doing the simple loop approach is O(n) with an unordered_set. You sould use find instead of count though. @Creatinine set_intersection needs a sorted set. – Normand 8/1, 2018 at 22:24

In fact, a loop-based solutions is the best thing you can use with std::unordered_set.

There is an algorithm called std::set_intersection which allows to find an intersection of two sorted ranges:

Constructs a sorted range beginning at d_first consisting of elements that are found in both sorted ranges [first1, last1) and [first2, last2).

As you deal with std::unordered_set, you cannot apply this algorithm because there is no guaranteed order for the elements in std::unordered_set.

My advice is to stick with loops as it explicitly says what you want to achieve and has a linear complexity (O(N), where N is a number of elements in the unordered set you traverse with a for loop) which is the best compexity you might achieve.

Emelinaemeline answered 8/1, 2018 at 22:21 Comment(3)

So the complexity is O(min(len(s), len(l))) where s and l are the sets being intersected. Obviously you want to make N=min(len(s), len(l)) and not max as almost all libraries do such as in Python's set object. Just clarifying since sometimes these details are forgotten. – Transitory 30/11, 2019 at 6:14

@GregoryMorse: where did you get that info about Python set.__and__ being max of operands lengths? obviously it's not true – Koeninger 27/10, 2020 at 13:58

This answer lacks the important detail already pointed out by @GregoryMorse : "the unordered set you traverse with a for loop" should be the smaller of the two unordered sets. – Kirakiran 24/11, 2020 at 3:2

As the accepted answer explains, there is nothing that directly does what you want, and std::set_intersection isn't intended for intersection std::unordered_sets, even though it sounds like it. The ideal solution depends on the C++ standard you're using.

Given two sets std::unordered_set<T> a, b; ...

In-place

... the best method to turn a into the intersection a & b is as follows:

// C++20 (needs <unordered_set>)
std::erase_if(a, [&b](int e) {
    return !b.contains(e);
});

// C++11 (needs <unordered_set>)
for (auto it = a.begin(); it != a.end();) {
    if (!b.count(*it)) 
        it = a.erase(it);
    else               
        ++it;
}

Non-modifying

Alternatively, to compute a new set c which contains the intersection of a and b:

// C++23 (needs <unordered_set>, <ranges>)
std::unordered_set<int> c(std::from_range, a | std::views::filter([&b](int e) {
    return b.contains(e);
});

// C++20 (needs <unordered_set>, <ranges>)
auto view = a | std::views::filter([&b](int e) {
    return b.contains(e);
};
std::unordered_set<int> c(view.begin(), view.end());

// C++11 (needs <unordered_set>)
std::unordered_set<int> c;
// TODO: consider reserving a size (optimistically |a| + |b|)
for (int e : a) {
    if (b.count(e)) { c.insert(e); }
}

Davon answered 25/8, 2023 at 13:29 Comment(0)

-5

There is a function from std called set_intersection. However, it would have a very high complexity using it with std::set as input parameter.. A better solution is, create two vectors from those sets and use set_intersection with vectors as input parameters.

Grabowski answered 8/1, 2018 at 22:26 Comment(20)

doesn't set_intersection require a sorted input? – Eugenioeugenius 8/1, 2018 at 22:27

"However, it would have a very high complexity using it with std::set as input parameter." - Would you mind expanding on that? – Unijugate 8/1, 2018 at 22:27

If the values are already in std::unordered_set there is no more efficient approach then the simple loop. Your suggestion is just more complicated and worse performance. – Normand 8/1, 2018 at 22:31

@Normand Moving the value to vectors, sorting the vector, performing the intersection might be faster than the two above loops. You'd have to benchmark. – Unijugate 8/1, 2018 at 22:35

@Unijugate Not really. Using a loop + find gives O(n) time. Using 2 sorted vectors would also give O(n) time. But if you have to make the vectors AND sort them there is no way for that method to be faster. Please keep in mind that std::unordered_set has O(n) lookup unless the set is gigantic. – Normand 8/1, 2018 at 22:39

@Normand You are talking about theoretical asymptotic complexities. Vectors are known to perform faster than most containers for most operations, even faster at performing tasks dedicated to containers such as sorted insertion. You cannot assume that this approach will be slower unless you actually benchmark it. See e.g. lemire.me/blog/2017/01/27/…. – Unijugate 8/1, 2018 at 22:41

@Unijugate Sure, in principle you are right about the fact that you need to benchmark something to really know. But here we are talking about comparing walking and running. In principle you would need to test... but in practice. – Normand 8/1, 2018 at 23:35

@Normand Not really walking vs running. I just did some quick tests with clang, -O3, and sets with 300000 elements and both approaches have the same running time (within 2%). I used random data, who knows what could happens using data with specific distribution... – Unijugate 8/1, 2018 at 23:37

@Unijugate Surely you are not including creating + sorting the vectors in the running time? – Normand 8/1, 2018 at 23:39

@Normand You should have a look at a Bjarne Stroustrup's talk (I think it was him, not 100% sure, maybe Herb Sutter... ) at cppcon (can't remember which) where he spoke about vector performance vs other standard containers. – Unijugate 8/1, 2018 at 23:42

@Normand Of course I am including creating and sorting the vector. – Unijugate 8/1, 2018 at 23:42

Let us continue this discussion in chat. – Normand 8/1, 2018 at 23:46

@Normand I would be happy to continue this discussion but it's 1 a.m here so I'll pass for tonight. But I took the time to reboot my computer to upload this benchmark gist if you want to try it out yourself: gist.github.com/anonymous/bee65aa443ce6e35ae82fc3e5e89cca1 I get better performance with the non-vector version, but as you will notice, it is not as obvious as it would seem. I actually compiled with g++ (TDM on Windows), I am so used to use clang at work that I wrote clang in my previous comment... – Unijugate 8/1, 2018 at 23:51

I made one too The loop seems to end up somewhere around 55-65% of the intersect time on my machine with both our tests. – Normand 9/1, 2018 at 0:6

@Normand how many elements did you test with? Complexity differences are of course sensitive to dataset size. Complexity can only trump all other considerations when you know your data can be of infinite size. – Nerti 9/1, 2018 at 0:16

@MarkRansom 10000, 100000 and 1000000. – Normand 9/1, 2018 at 0:28

@Unijugate If I modify your benchmark to save the intersection results in a vector instead of a set, the loop version is almost 5x faster. – Normand 9/1, 2018 at 1:19

@Normand true, but if you convert it back to an unordered_set, the loop version is only twice as fast as the vector version. Even if the loop version is 5 times faster than the vector one, this is not what you'd expect with theoretical complexity (log2(1e6) ~ 20) so you'd expect the loop version to be 20 times faster (due to the sort complexity), not counting memory allocation. Anothing thing to mention is that with the vector version, the result is already sorted, so you could use it directly without having to sort it. [...] – Unijugate 9/1, 2018 at 9:26

@Normand [...] My point is that you cannot simply look at the theoretical complexities of the operations on the containers to choose the best one, you have to consider how the container is used in the algorithm. If you are not looking for performance but for simplicity, then set and unordered_set might be great, but when I seek performance, I often try to do as much as possible with vectors, because vectors are much faster due to their memory layout and much easier to optimize for the compiler. [...] – Unijugate 9/1, 2018 at 9:29

@Normand [...] Even if the loop version end up winning here, I hope that by now you get my original point! BTW, I get better result with vector using your version, 100000 elements and -O3 on clang (strange, isn't it?). Also, I'd like to point out that our benchmarks are quite poors since the two original sets tends to have an empty intersection, which is quite a specific here. – Unijugate 9/1, 2018 at 9:31

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

In-place

Non-modifying

Recommended topics

Hot tags