Why would anyone use set instead of unordered_set?

Asked 28/8, 2009 at 22:42 Answered 5/12, 2023 at 15:38

Solved c++c++11 stl unordered-set stdset

209

C++0x is introducing unordered_set, which is available in boost and many other places. What I understand is that unordered_set is hash table with O(1) lookup complexity. On the other hand, set is nothing but a tree with log(n) lookup complexity.

Why on earth would anyone use set instead of unordered_set? I.e., is there a need for set anymore?

Copepod answered 28/8, 2009 at 22:42 Comment(4)

Your question is fundamentally asking if is there a need for a tree anymore. – Bridgework 28/8, 2009 at 23:23

I think I stated it clearly in the first line, that this is somehow stupid question. I was missing something and now I got the answer :) – Copepod 28/8, 2009 at 23:30

The real reason is that things aren't as B&W as they seem. There are a lot of greys and other colors in between. You need to remember these containers are tools. Sometimes performance isn't crucial and convenience is far more meaningful. If people all looked for the most efficient solution we"d never use C++ (not to mention Python) in the first place and continuously write and optimize code in machine language. – Erna 26/11, 2015 at 14:39

(Why on earth would anyone use a generic name for an implementation/interface with promises beyond those implied by that name, creating an awkward situation for ones without?) – Premillenarian 8/9, 2016 at 21:46

261

When, for someone who wants to iterate over the items of the set, the order matters.

Dichlorodifluoromethane answered 28/8, 2009 at 22:45 Comment(3)

Is it ordered according to the insertion order, or according to real comparison using operators < > ? – Windburn 25/11, 2015 at 14:8

It's ordered using std::less by default; you can override this and supply your own comparison operator. cplusplus.com/reference/set/set – Dichlorodifluoromethane 26/11, 2015 at 6:48

Or sometimes when you only want to iterate, even if the order doesn't matter. – Delightful 1/1, 2020 at 10:21

441

Unordered sets have to pay for their O(1) average access time in a few ways:

set uses less memory than unordered_set to store the same number of elements.
For a small number of elements, lookups in a set might be faster than lookups in an unordered_set.
Even though many operations are faster in the average case for unordered_set, they are often guaranteed to have better worst case complexities for set (for example insert).
That set sorts the elements is useful if you want to access them in order.
You can lexicographically compare different sets with <, <=, > and >=. unordered_sets are not required to support these operations.

Planar answered 28/8, 2009 at 23:33 Comment(6)

+1, all excellent points. People tend to overlook the fact that hashtables have O(1) average-case access time, meaning they can occasionally have big delays. The distinction can be important for real-time systems. – Farthing 3/9, 2009 at 8:44

Good points, however here ( en.cppreference.com/w/cpp/container/unordered_set/operator_cmp) it is stated that we can compare unordered_sets. – Variometer 3/1, 2015 at 17:20

Define a "small number of elements" – Rea 3/4, 2015 at 13:25

@SunjayVarma usually 100 elements is a good cut-off between the two. When in doubt nothing can replace testing performance of the two in your specific use-case. – Piatt 12/9, 2016 at 18:9

@MichieluithetBroek Only equality comparison is stated, not ordering (<). – Dessiedessma 20/11, 2017 at 13:21

In defining "small number of elements" I suggest one actually measures what that means for the problem they are attempting to solve. Ie, dont assume. Benchmark if you have the time. – Onlybegotten 24/7, 2018 at 17:2

261

When, for someone who wants to iterate over the items of the set, the order matters.

Dichlorodifluoromethane answered 28/8, 2009 at 22:45 Comment(3)

Is it ordered according to the insertion order, or according to real comparison using operators < > ? – Windburn 25/11, 2015 at 14:8

It's ordered using std::less by default; you can override this and supply your own comparison operator. cplusplus.com/reference/set/set – Dichlorodifluoromethane 26/11, 2015 at 6:48

Or sometimes when you only want to iterate, even if the order doesn't matter. – Delightful 1/1, 2020 at 10:21

Whenever you prefer a tree to a hash table.

For instance, hash tables are "O(n)" at worst case. O(1) is the average case. Trees are "O(log n)" at worst.

Pinzler answered 28/8, 2009 at 22:44 Comment(10)

/Balanced/ trees are O(ln n) in the worst case. You can end up with O(n) trees (essentially linked lists). – Jar 28/8, 2009 at 22:50

If you can write a reasonably intelligent hash function, you can almost always get O(1) perf out of a hashtable. If you can't write such a hash function of if you need to iterate "in order" over your set, then you should use a tree. But you shouldn't use a tree because you're scared of "O(n) worst-case performance." – Lemniscate 28/8, 2009 at 22:54

stager: To be pedantic, yes. However, we're talking about set in C++ which is typically implemented as a balanced binary search tree. We should have specify the actual operation to talk about complexity. In this context it's obvious that we're talking about lookup. – Pinzler 28/8, 2009 at 22:54

Justin L: It's just one reason you might prefer a tree. The core of my answer is the first line. Whenever you prefer a tree data structure to a hash table. There are plenty of cases that trees are preferred to hash tables. Hash tables particularly suck at things like "range intersections." – Pinzler 28/8, 2009 at 22:56

stl trees are almost universally implemented red-black trees, an advanced self balancing tree. There really are cases where O(n) look up in the worse case is not acceptable. A web service that provide and interface to store user values should not use a hash map, as a malicious user could effectively create a DoS by storing specially crafted values. Critical, time sensitive systems may also not allow for O(n) lookup, air traffic control etc. Though in general you're right, use the hash maps by default and only switch the tree version when you've got a real need. – Oulu 2/9, 2009 at 19:44

@strager: The C++ standard mandates logarithmic (not amortized logarithmic) insertion and lookup times for all associative containers (set, map etc.) So balanced trees will be used, at least until someone comes up with a better data structure (don't hold your breath!) – Farthing 3/9, 2009 at 8:48

the linear worst case can be mitigated by using a red-black tree to disambiguate the collisions in the hash table. The standard version doesn't do it because it not trivial to implement, but it is theoretically possible to have logarithmic worst case even for hash tables. – Puce 23/9, 2013 at 19:20

If you get O(n) out of a hash table, you absolutely need to fix your hash function... – Fathom 20/11, 2017 at 13:27

The worst case O(n) for hash tables is easy to avoid, for that to happen everything in your hash table has to be linked to one hash, in which case you need a better hash function. – Farwell 20/5, 2019 at 2:37

@Farwell No matter the hash-function if those values can be user-provided they can craft them in a way to get O(n). – Elysia 12/5, 2021 at 9:14

Use set when:

We need ordered data(distinct elements).
We would have to print/access the data (in sorted order).
We need predecessor/successor of elements.

Use unordered_set when:

We need to keep a set of distinct elements and no ordering is required.
We need single element access i.e. no traversal.

Examples:

set:

Input : 1, 8, 2, 5, 3, 9

Output : 1, 2, 3, 5, 8, 9

Unordered_set:

Input : 1, 8, 2, 5, 3, 9

Output : 9 3 1 8 2 5 (maybe this order, influenced by hash function)

Mainly difference :

Note:(in some case set is more convenient) for example using vector as key

set<vector<int>> s;
s.insert({1, 2});
s.insert({1, 3});
s.insert({1, 2});

for(const auto& vec:s)
    cout<<vec<<endl;   // I have override << for vector
// 1 2
// 1 3

The reason why vector<int> can be as key in set because vector override operator<.

But if you use unordered_set<vector<int>> you have to create a hash function for vector<int>, because vector does't have a hash function, so you have to define one like:

struct VectorHash {
    size_t operator()(const std::vector<int>& v) const {
        std::hash<int> hasher;
        size_t seed = 0;
        for (int i : v) {
            seed ^= hasher(i) + 0x9e3779b9 + (seed<<6) + (seed>>2);
        }
        return seed;
    }
};

vector<vector<int>> two(){
    //unordered_set<vector<int>> s; // error vector<int> doesn't  have hash function
    unordered_set<vector<int>, VectorHash> s;
    s.insert({1, 2});
    s.insert({1, 3});
    s.insert({1, 2});

    for(const auto& vec:s)
        cout<<vec<<endl;
    // 1 2
    // 1 3
}

you can see that in some case unordered_set is more complicated.

Mainly cited from: https://www.geeksforgeeks.org/set-vs-unordered_set-c-stl/ https://mcmap.net/q/128891/-c-unordered_set-of-vectors

Buffet answered 6/9, 2018 at 12:17 Comment(0)

g++ 6.4 stdlibc++ ordered vs unordered set benchmark

I benchmarked this dominant Linux C++ implementation to see the difference:

The full benchmark details and analysis have been given at: What is the underlying data structure of a STL set in C++? and I will not repeat them here.

"BST" means "tested with std::set and "hash map" means "tested with std::unordered_set. "Heap" is for std::priority_queue which I analyzed at: Heap vs Binary Search Tree (BST)

As a quick summary:

the graph clearly shows that under these conditions, hashmap insertion were always a lot faster when there are more than 100k items, and the difference grows as the number of items increases

The cost of this speed boost is that you are not able to efficiently traverse in order.
the curves clearly suggest that ordered std::set is BST-based and std::unordered_set is hashmap based. In the reference answer, I further confirmed that by GDB step debugging the code.

Similar question for map vs unordered_map: Is there any advantage of using map over unordered_map in case of trivial keys?

Hara answered 4/4, 2019 at 8:21 Comment(0)

While this answer might be 10 years late, it's worth pointing out that std::unordered_set also has security downsides.

If the hash function is predictable (this is typically the case unless it applies counter-measures such as a randomized salt), attackers can hand-craft data that produces hash collisions and causes all insertions and look-ups to take O(n) time.

This can be used for very efficient and elegant denial-of-service attacks.

Many (most?) implementations of languages that internally employ hash maps have run into this:

Burgher answered 21/11, 2019 at 14:44 Comment(0)

Because std::set is part of Standard C++ and unordered_set isn't. C++0x is NOT a standard, and neither is Boost. For many of us, portability is essential, and that means sticking to the standard.

Rapping answered 28/8, 2009 at 22:47 Comment(3)

If i understand him correctly, he is not asking why people currently still use set. He is informing himself about C++0x. – Hapless 28/8, 2009 at 22:58

Maybe. I thought everyone knew hash tables and trees solved different problems. – Rapping 28/8, 2009 at 23:4

Well, it's a standard now (only took a few years) – Teage 28/11, 2011 at 15:45

Consider sweepline algorithms. These algorithms would fail utterly with hash tables, but work beautifully with balanced trees. To give you a concrete example of a sweepline algorithm consider fortune's algorithm. http://en.wikipedia.org/wiki/Fortune%27s_algorithm

Valentino answered 2/9, 2009 at 8:0 Comment(1)

I think such reference is too complex given the question. (I had to look it up) – Tarttan 7/11, 2015 at 0:17

Pardon me, one more thing worth noticing about the sorted property:

If you want a range of data in container, for example: You stored time in set, and you want time from 2013-01-01 to 2014-01-01.

For unordered_set it is impossible.

Of course, this example would be more convincing for usage cases between map and unordered_map.

Perspire answered 3/2, 2015 at 19:23 Comment(0)

One more thing, in addition to what other people already mentioned. While the expected amortized complexity for inserting an element to an unordered_set is O(1), every now and then it will take O(n) because the hash-table needs to be restructured (the number of buckets needs to change) - even with a 'good' hash function. Just like inserting an element in a vector takes O(n) every now and then because the underlying array needs to be reallocated.

Inserting in a set always takes at most O(log n). This might be preferable in some applications.

Rohr answered 14/3, 2011 at 15:29 Comment(0)

Off hand, I would say it is convenient to have things in a relationship if you're looking to convert it into a different format.

It is also possible that whilst one is faster to access, the time to build the index or the memory used when creating and/or accessing it is greater.

Kimikokimitri answered 28/8, 2009 at 22:44 Comment(1)

+1, Big Oh notation hides the constant factors, and for typical problem sizes it's often the constant factors that matter most. – Farthing 3/9, 2009 at 8:50

If you want to have things sorted, then you would use set instead of unordered_set. unordered_set is used over set when ordering stored does not matter.

Tali answered 28/8, 2009 at 22:46 Comment(0)

Here's a practical reason that I haven't seen listed... if used incorrectly in buggy code, unordered sets can cause code to behave differently on different machines. This is because the order that the values are stored is not consistent across machines.

If code is (incorrectly) written that relies on the order of storage, the result will be that the program behaves inconsistently between different machines. Practically, this could happen if the unordered set is part of the implementation of a function/method that returns a list of values. The client of that function may not realize that an unordered set is being used, and may not realize that the order of the returned list is not guaranteed to be consistent/portable.

Thus, unordered sets are a bit more unforgiving to the programmer than ordered sets. They introduce this additional mechanism for confusing code behavior, which can lead to time consuming/confusing bugs because they may not be reproducible between machines.

Pfister answered 4/8, 2021 at 22:35 Comment(0)

In addition to the order and performance, there is another reason to use set rather than unordered set: set can be used to build "set of tuple" when implementing complex data structures, but unordered set doesn't support it.

Opalopalesce answered 9/5, 2023 at 17:47 Comment(0)

With C++23 you should also consider std::flat_set, which is in practice a permanently sorted vector. It comes handy in use cases where insert and erase are rare (e.g.: all elements are known at time of creation). It offers fast search time and considerably reduced memory footprint. The Boost version is discussed here.

Mayers answered 5/12, 2023 at 15:38 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags