How std::unordered_map is implemented
Asked Answered
H

1

110

c++ unordered_map collision handling , resize and rehash

This is a previous question opened by me and I have seen that I am having a lot of confusion about how unordered_map is implemented. I am sure many other people shares that confusion with me. Based on the information I have know without reading the standard:

Every unordered_map implementation stores a linked list to external nodes in the array of buckets... No, that is not at all the most efficient way to implement a hash map for most common uses. Unfortunately, a small "oversight" in the specification of unordered_map all but requires this behavior. The required behavior is that iterators to elements must stay valid when inserting or deleting other elements

I was hoping that someone might explain the implementation and how it fits with the C++ standard definition ( in terms of performance requirements ) and if it is really not the most efficient way to implement an hash map data structure how it can be improved ?

Headroom answered 29/6, 2015 at 10:4 Comment(8)
The standard doesn't dictate the implementation but rather performance requirements. Thus, implementation of STL containers might differ from vendor to vendor.Pentarchy
I don't understand why this is too broad or irrelevant nor why it did get two negative votes. It is a perfectly valid question from my point of view...Headroom
I think it's because your point of view isn't what's used as a yardstick for which sort of questions are on-topic on this site. No-one can "explain the implementation" as there are many potential implementations.Rexanna
Than someone might come and tell that standard implies this, this and this and two,three major implementations are using a linked list, allocating memory and resizing it when necessary etc,. No one is asking for the actual code of the implementation here...Headroom
I think that the complexity and iterator requirements essentially require an implementation as hash table with chaining.Siemens
I can understand the 2nd, 4th and I will modify it. Thanks, but I mean how many questions in this site is asked after reading the standard? I basically quoted something I have found on the web I don't say that this guys knows the standard better than the committee or anything..Headroom
@Alex I am not interested about any particular implementation as well. What I was hoping to get is something like this: ex: std:map is mostly implemented using a red&black tree with pruning - unordered_map is implemented mostly using .... and ... . That is it.Headroom
@Walter Many similar questions should be closed as opinion-based: however, in this case, there are restrictions on the implementation of the std unordered meows that are strong enough that a good answer can be generated.Regalado
P
121

The Standard effectively mandates that implementations of std::unordered_set and std::unordered_map - and their "multi" brethren - use open hashing aka separate chaining, which means an array of buckets, each of which holds the head of a linked list†. That requirement is subtle: it is a consequence of:

  • the default max_load_factor() being 1.0 (which means the table will resize whenever size() would otherwise exceed 1.0 times the bucket_count(), and
  • the guarantee that the table will not be rehashed unless grown beyond that load factor.

That would be impractical without chaining, as the collisions with the other main category of hash table implementation - closed hashing aka open addressing - become overwhelming as the load_factor() approaches 1.

References:

23.2.5/15: The insert and emplace members shall not affect the validity of iterators if (N+n) < z * B, where N is the number of elements in the container prior to the insert operation, n is the number of elements inserted, B is the container’s bucket count, and z is the container’s maximum load factor.

amongst the Effects of the constructor at 23.5.4.2/1: max_load_factor() returns 1.0.

† To allow optimal iteration without passing over any empty buckets, GCC's implementation fills the buckets with iterators into a single singly-linked list holding all the values: the iterators point to the element immediately before that bucket's elements, so the next pointer there can be rewired if erasing the bucket's last value.

Regarding the text you quote:

No, that is not at all the most efficient way to implement a hash map for most common uses. Unfortunately, a small "oversight" in the specification of unordered_map all but requires this behavior. The required behavior is that iterators to elements must stay valid when inserting or deleting other elements

There is no "oversight"... what was done was very deliberate and done with full awareness. It's true that other compromises could have been struck, but the open hashing / chaining approach is a reasonable compromise for general use, that copes reasonably elegantly with collisions from mediocre hash functions, isn't too wasteful with small or large key/value types, and handles arbitrarily-many insert/erase pairs without gradually degrading performance the way many closed hashing implementations do.

As evidence of the awareness, from Matthew Austern's proposal here:

I'm not aware of any satisfactory implementation of open addressing in a generic framework. Open addressing presents a number of problems:

• It's necessary to distinguish between a vacant position and an occupied one.

• It's necessary either to restrict the hash table to types with a default constructor, and to construct every array element ahead of time, or else to maintain an array some of whose elements are objects and others of which are raw memory.

• Open addressing makes collision management difficult: if you're inserting an element whose hash code maps to an already-occupied location, you need a policy that tells you where to try next. This is a solved problem, but the best known solutions are complicated.

• Collision management is especially complicated when erasing elements is allowed. (See Knuth for a discussion.) A container class for the standard library ought to allow erasure.

• Collision management schemes for open addressing tend to assume a fixed size array that can hold up to N elements. A container class for the standard library ought to be able to grow as necessary when new elements are inserted, up to the limit of available memory.

Solving these problems could be an interesting research project, but, in the absence of implementation experience in the context of C++, it would be inappropriate to standardize an open-addressing container class.

Specifically for insert-only tables with data small enough to store directly in the buckets, a convenient sentinel value for unused buckets, and a good hash function, a closed hashing approach may be roughly an order of magnitude faster and use dramatically less memory, but that's not general purpose.

A full comparison and elaboration of hash table design options and their implications is off topic for S.O. as it's way too broad to address properly here.

Paulsen answered 29/6, 2015 at 10:41 Comment(9)
"... open hashing / chaining ... handles arbitrarily-many insert/erase pairs without gradually degrading performance the way many closed hashing implementations do." Actually, it does gradually degrade performance. What it avoids is the precipitous performance loss typical of most closed hashing. Either way, the performance degrades from O(1) to O(N), but with closed hashing, that happens from roughly 90% full to 100% full, whereas with open hashing it's typically from (say) 100% full to 1000% full (i.e., you've inserted ten times as many items as there are slots in the table).Dimitri
You can also do open hashing with a balanced tree instead of a list for each bucket. In this case, the degradation is from O(1) to O(log N). In this case, even with 10x as many items as slots in the table, there's still only minimal performance degradation (with the proviso that it's generally somewhat slower, even when minimally utilized).Dimitri
@JerryCoffin: true - I've heard that's what (one of?) the key Java implementations does, but it would adversely affect iteration over the full container compared to the singly linked list approach GCC uses (outlined in the answer).Paulsen
@JerryCoffin: my reply above addressed your second comment, but just noticed I never addressed your first. Regarding "Actually, it does gradually degrade...from (say) 100% full to 1000% full" - that doesn't happen due to the max load factor - the table resizes at 100%. But, you're not talking about the performance issue with open hashing that I was alluding to either, which is that when you erase an element you either have to mark the bucket as having been in use and keep searching past it (small lasting cost), or "compact" a collision chain into the bucket (larger up-front cost).Paulsen
@MohitShah Nonsense. It's an extremely good answer, which reasons clearly and - crucially - cites the original proposal. If you had trouble reading it, that's your problem, not the answer's.Askari
If "the iterators point to the element immediately before that bucket's elements", how does the iterator returned by unordered_map::begin iterate forward through the unordered_map?Felly
@palapapa: the unordered container must store a "head" iterator/pointer to the first element in the singly linked list: that pointer is the one to be rewired if the first element (at *begin()) is erased. It's not uncommon for linked list operations to be programmed so that little code has to be special-cased for this, by treating a pointer to the head much as a pointer to the "next" part of an element-embedding node. (I haven't taken another look at any Standard Library implementation whilst writing this comment, but it's pretty basic DS&A material.)Paulsen
@TonyDelroy I don't think I am understanding your answer correctly. If every bucket contains an iterator to a linked list containing the elements in that bucket, how can the iterators also point to the elements before the bucket? By "before" did you mean the last element of the previous non-empty bucket, or just the previous non-empty bucket? Regardless of which is true, how does the iterator returned by begin iterator the elements if the iterator in every bucket points to the previous bucket? And how does all of this help unordered_map skip empty buckets?Felly
@palapapa: "last element of the previous non-empty bucket" - almost - rather, the last element of another (unspecified) non-empty bucket. So, the order in which buckets' elements appear in the list doesn't match the bucket order, but any given bucket's elements are together, linked by next pointers. "how does the iterator returned by begin iterator the elements if the iterator in every bucket points to the previous bucket" - just follows the next pointers through the linked list - buckets are irrelevant, which is how "all of this help unordered_map skip empty buckets".Paulsen

© 2022 - 2024 — McMap. All rights reserved.