How does C++ STL unordered_map resolve collisions?
Asked Answered
E

2

74

How does C++ STL unordered_map resolve collisions?

Looking at the http://www.cplusplus.com/reference/unordered_map/unordered_map/, it says "Unique keys No two elements in the container can have equivalent keys."

That should mean that the container is indeed resolving collisions. However, that page does not tell me how it is doing it. I know some ways to resolve collisions like using linked lists and/or probing. What I want to know is how the c++ STL unordered_map is resolving it.

Emeric answered 3/2, 2014 at 2:1 Comment(12)
It's implementation dependent. A language reference won't tell you how it's done, because it is not specified in the standard how it should be done.Coston
libstdc++ uses linear chaining, but other implementations of the STL may use other techniquesNide
@BrianBi: So does the current libc++. I wonder if the standard even permits open addressing...Dessalines
We are using different implementations of STL? I thought STL is "the" library every C++ programs use when using the STL.Emeric
Technically, we're not using the STL at all. We're using the C++ Standard Library, which is an interface specification. And yes, it has multiple implementations.Coston
for what it's worth, even the hashing algorithm or the implementation of std::hash is implementation-defined, if you are trying to get a specific answer to this, there is very little that can be said about it according to the standard.Macaluso
Ah I get it. So if libstdc++ is an implementation of the STL and the STL says "No two elements in the container can have equivalent keys.", how come it uses linear chaining to resolve collisions? Doesn't linear chaining puts the elements with the same hash key in the same hash key place?Emeric
No, the STL was a library decades ago which had ideas stolen and incorperated into the C++ standard library. Each compliler of C++ must implement the standard library through whatever means it chooses. People tend to call the parts of the standard library inspired by the STL the STL, albeit incorrectly.Redouble
"No two elements in the container can have equivalent keys" is a condition on Pred parameter of unordered_map. The notion of collisions applies to Hash parameter. Two keys may not be equivalent but may still hash to the same value - the very definition of hash collision.Hostelry
@Igor Ah right, I got confused with the key and the hash value. Thanks for clearing it up!Emeric
The question uses the word Keys, but it looks like you're really asking about Hash Codes, which is what the answer below describes. Please clarify the question text.Darladarlan
@Emeric Collisions are recommeded to be unlikely to cause collisions to avoid different elements ending up in a same bucket, because a later search will imply a longer linear search. A hash collision from different keys won't make a insertion fail because "an element already exists", for instance. Besides, two different keys with different hashes can still cause bucket collision if both hashes are multiple of bucket_count. That's why some implementations (like gcc) use a prime number as bucket_count, to decrease the probability of modular arithmetic causing a collision.Helterskelter
H
94

The standard defines a little more about this than most people seem to realize.

Specifically, the standard requires (§23.2.5/9):

The elements of an unordered associative container are organized into buckets. Keys with the same hash code appear in the same bucket.

The interface includes a bucket_count that runs in constant time. (table 103). It also includes a bucket_size that has to run in time linear on the size of the bucket.

That's basically describing an implementation that uses collision chaining. When you do use collision chaining, meeting all the requirements is somewhere between easy and trivial. bucket_count() is the number of elements in your array. bucket_size() is the number of elements in the collision chain. Getting them in constant and linear time respectively is simple and straightforward.

By contrast, if you use something like linear probing or double hashing, those requirements become all but impossible to meet. Specifically, all the items that hashed to a specific value need to land in the same bucket, and you need to be able to count those buckets in constant time.

But, if you use something like linear probing or double hashing, finding all the items that hashed to the same value means you need to hash the value, then walk through the "chain" of non-empty items in your table to find how many of those hashed to the same value. That's not linear on the number of items that hashed to the same value though--it's linear on the number of items that hashed to the same or a colliding value.

With enough extra work and a fair amount of stretching the meaning of some of the requirements almost to the breaking point, it might be barely possible to create a hash table using something other than collision chaining, and still at least sort of meet the requirements--but I'm not really certain it's possible, and it would certain involve quite a lot of extra work.

Summary: all practical implementations of std::unordered_set (or unordered_map) undoubtedly use collision chaining. While it might be (just barely) possible to meet the requirements using linear probing or double hashing, such an implementation seems to lose a great deal and gain nearly nothing in return.

Housecarl answered 3/2, 2014 at 4:15 Comment(11)
"It's still possible to meet the requirement [using linear probing or double hashing]" / "Then you walk through the "chain" of occupied items to find a slot where you can insert this item." isn't compatible with the "Keys with the same hash code appear in the same bucket." requirement. Of course, it's debatable if collision chaining puts anything "in the same bucket" anyway, but you're clearly proposing putting keys with the same hash in different buckets. 23.2.5/9 seems an unfortunate restriction... probing is useful for more than "load factor low" - many benefits from avoiding heap....Cataclysm
@TonyD: not really--in this case, a "bucket" is no longer a physical thing, but simply all the slots that hashed to the same value. You'd have to build enough intelligence into local_iterator to skip across any colliding items, but it still looks entirely possible to me.Housecarl
If it's that flexible, "Keys with the same hash code appear in the same bucket." has no meaning at all. Agree to disagree etc.. Cheers.Cataclysm
@TonyD: Not really. A bucket is the set of items with keys that hashed to a particular value, and which can be iterated with a local_iterator. Given that they're specifically avoiding specifying an implementation, it really shouldn't mean much about the implementation.Housecarl
Well, finding a mutually-agreed authoritative definition of "bucket" from e.g. Dijkstra isn't easy with Google, so let's just consider whether your postulated implementation can meet other requirements. Per your rationale for needing count, b.find(k) can't be worst-case O(b.size()) where b is a "logical" bucket size(). Further, b.begin(n), b.end(n) et al are required to have constant complexity, so would need to be stored alongside your count at the originally hashed-to bucket.Cataclysm
@JerryCoffin great explanation thank you - I'm curious though what you mean by being "linear on the number of items that hashed to the same or a colliding value". Colliding values, by definition, are ones that hash to the same value, right? So what does "or a colliding value" mean?Upturn
@rb612: Hashing is usually done in two pieces. You start by computing some large, fixed-sized value (e.g., 64 bits). Then you reduce that to the range of indices for your hash table. As I was using the terms, "hashing to the same value" would mean the original hashes were equal, and colliding referred to the reduced range one being equal.Housecarl
I perceive another heavy implication about implementation: Chaining seems inextricably related (by cause or consequence, I don't know; I haven't read any original rationale) with the Standard's guarantee that references stay valid forever, doesn't it? By my understanding, with open addressing, erasing would mean keeping around 'dummy' elements to avoid shifting others (rather than only optionally optimising through use of this like some open addressing maps), and insertion is precluded by not being able to put elements at the right place (at least without rebuilding or over-reserving buckets).Nonconformance
@underscore_d: Yeah, that would get...ugly as well. Haven't thought through carefully enough to be sure it's truly impossible, but even at best it would be ugly.Housecarl
Yeah, it might be possible to guarantee reference validity under open addressing, for highly specific contents and/or spans of time. However, it all seems highly contrary to ever being a general-purpose container, so it naturally follows that we must choose one or the other. I see someone else came to the same conclusion as me - and doesn't like it! - here, although I'm not so critical given that my current use for std::unordered_map uses references to avoid a 2nd round of lookups. Then again, if lookup were faster, I wouldn't need that... hahaNonconformance
Getting back to how/why the Standard does what it does, this answer quotes the original 2003 proposal, which did consider and ultimately discount open addressing due to technical complexities and a lack of proven implementation experience. I guess things might be different today, at least judging by the number and benchmarks of open addressing implementations I've been reading about.Nonconformance
W
5

I found this answer looking for how to detect when my types are colliding, so I will post this in case that is the intent of the question.:

I believe there's some misconception about "Unique keys No two elements in the container can have equivalent keys."

look at the code below

//pseudocode
std::unordered_map<int, char> hashmap;
hashmap[5] = 'a';
hashmap[5] = 'b'; //replace 'a' with 'b', there is no collision being handled.

I think the Jerry's answer is referring to the internal system that it uses to shrink keys to appropriate array indices.

If you want collisions to be handled for your types (with buckets), you need std::unordered_multimap and will have to iterate over

Hopefully this code can be read without the context I generated it with. it basically checks to see if any element in the bucket associated with the hash is the element I'm looking for.

//sp is std::shared_ptr
//memo is std::unordered_multimap< int, sp<AStarNode> >

//there's probably multiple issues with this code in terms of good design (like using int keys rather than unsigned)

bool AStar_Incremental::hasNodeBeenVisited(sp<AStarNode> node)
{
    using UMIter = std::unordered_multimap<int, sp<AStarNode> >::iterator;

    bool bAlreadyVisited = false;

    //get all values for key in O(1*)
    int hash = WorldGrid::hashGrid(node->location);
    std::pair<UMIter, UMIter> start_end = memo.equal_range(hash); //bucket range
    UMIter start = start_end.first;
    UMIter end = start_end.second;

    //hopefully this is implemented to be O(m) where m is the bucket size.
    for(UMIter bucketIter = start; bucketIter != end; ++bucketIter)
    {
        sp<AStarNode> previousNode = bucketIter->second;
        sf::Vector2i& previousVisit = previousNode->location;
        if (previousVisit == node->location)
        {
            bAlreadyVisited = true;
            break;
        }
    }

    return bAlreadyVisited;
}
Wilscam answered 15/3, 2019 at 12:41 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.