Why is std::map implemented as a red-black tree?
Asked Answered
C

6

230

Why is std::map implemented as a red-black tree?

There are several balanced binary search trees (BSTs) out there. What were design trade-offs in choosing a red-black tree?

Collimator answered 13/3, 2011 at 8:33 Comment(9)
Although all implementations I've seen use an RB-tree, note that this is still implementation-dependent.Galyak
@Thomas. It is implementation-dependent, so why it is so that all implementation use RB-trees?Collimator
I'd really like to know if any STL implementer has thought about using a Skip List.Abernon
C++'s map and set are actually ordered map and ordered set. They are not implemented using hash functions. Every query would take O(logn) and not O(1), but the values will be always sorted. Starting from C++11 (i think), there are unordered_map and unordered_set, that are implemented using hash functions and while they are not sorted, most queries and operations are possible in O(1) (averagely)Amarelle
@Galyak that is true, but not that interesting in practice. The standard makes complexity guarantees with a specific algorithm, or set of algorithms in mind.Alberik
@JustinMeiners There are several other types of self-balancing trees with the same big-O complexity as a red-black tree, such as the B-tree and AVL tree.Galyak
a little late, but does clang use rbtree as well? where can I find it's source online?Showroom
I'm surprised that nobody has said anything about iterator invalidation. The STL's API guarantees that, when you insert or delete an element from a std::map, iterators pointing into other elements are not invalidated. This makes it veeery difficult, if not outright impossible, to store more than one element per dynamically allocated node, while also fulfilling the usual time complexity guarantees. (Queries and updates to a std::map must take at worst logarithmic time.) So, in practice, std::map implementations have to be self-balancing binary trees of some sort.Spirochaetosis
@MatthieuM. Your idea is great. Well-optimized skip lists (if we deploy nice unrolling and other memory management tricks) are much faster and are comparable with RBTs in memory use. The traversal is better than a RBT in which we have to go down and up also. I think it is more like a historical problem since skip lists came really late (1990s). At that time the C++ language has already been standardized.Uncloak
B
153

Probably the two most common self balancing tree algorithms are Red-Black trees and AVL trees. To balance the tree after an insertion/update both algorithms use the notion of rotations where the nodes of the tree are rotated to perform the re-balancing.

While in both algorithms the insert/delete operations are O(log n), in the case of Red-Black tree re-balancing rotation is an O(1) operation while with AVL this is a O(log n) operation, making the Red-Black tree more efficient in this aspect of the re-balancing stage and one of the possible reasons that it is more commonly used.

Red-Black trees are used in most collection libraries, including the offerings from Java and Microsoft .NET Framework.

Boice answered 13/3, 2011 at 8:47 Comment(21)
you make it sound like red-black trees can do tree modifications in O(1) time, which is not true. tree modifications are O(log n) for both red-black and AVL trees. that makes it moot whether the balancing part of the tree modification is O(1) or O(log n) because the main operation is already O(log n). even after all the slightly extra work that AVL trees do results in a more tightly balanced tree which leads to slightly faster lookups. so it is a perfectly valid tradeoff and does not make AVL trees inferior to red-black trees.Overcoat
@agks mehx, I am refering to the rotation part of the algorithm. I will update that to be more clear.Boice
You have to look beyond the complexity to actual runtime to see a difference - AVL trees generally have a lower total runtime when there are many more lookups than inserts/deletes. RB trees have a lower total runtime when there are many more inserts/deletes. The exact proportion at which the break occurs depends of course on many details of implementation, hardware, and exact usage, but since library authors have to support a wide range of usage patterns, they have to take an educated guess. AVL is also slightly harder to implement, so you might want a proven benefit to use it.Edgardoedge
@Steve So, why the decision was made to make RB tree a default impl? Do you mean that there are more inserts/deletes than lookups in general?Collimator
RB tree isn't a "default implementation". Each implementer chooses an implementation. As far as we know, they've all chose RB trees, so presumably this is either for performance or for ease of implementation/maintenance. As I said, the breakpoint for performance might not imply that they think there are more inserts/deletes than lookups, just that the ratio between the two is above the level where they think RB probably beats AVL.Edgardoedge
@Steve. I'd like to mark your comment an answer if it would be supported by numbersCollimator
@Denis: unfortunately the only way to get numbers is to make a list of std::map implementations, track down the developers, and ask them what criteria they used to make the decision, so this remains speculation.Edgardoedge
@ChrisTaylor Don't AVL trees and RB-Trees take O(1) to re-balance amortized post-insertion?Selfless
@Selfless - There is not much space to go into the details. Maybe the following link will help pages.cs.wisc.edu/~ealexand/cs367/NOTES/AVL-Trees/…Boice
@ChrisTaylor I think that link might be misleading. On average AVL trees also cost O(1) to restructure. See en.wikipedia.org/wiki/AVL_tree#Comparison_to_other_structures The intuition is that RB-Trees are basically 2-3-Trees, and losely speaking, AVL-Tree height changes don't propagate up until Ω(n) nodes have been modified.Selfless
While I would be prone to write some code to test all of this, it will take awhile to write the code for both and prove them correct. I found the following practical analysis which looks reasonably sound nathanbelue.blogspot.com/2013/01/…Boice
@ChrisTaylor Once a node's position has been found, AVL Trees need up to 2 rotations to fix up the heights. This is because an insertion can increase the height of the sub-tree by at most 1. This increase can be [1] in the smaller tree (no rotation needed) or [2] in the larger tree (possibly no increase in height post-rotation) or [3] if both trees are of the same height, then insertion might effectively INCREASE the overall height of the tree by 1. This increase is compensated by a rotation at a higher level node. i.e. Case [3] leads to case [1] or [2] which are fixed up by a single rotation.Selfless
@ChrisTaylor That blog is a great resource about the practical performance one might expect - thanks! :)Selfless
@ChrisTaylor Maximum 2 rotations are performed for an insertion into an AVL tree as opposed to log n rotations. see problem 3d here ocw.mit.edu/courses/electrical-engineering-and-computer-science/…Gnome
@NikunjBanka, that is the max number of rotations, however it is an O(log(n)) operation in terms of cost. To be more accurate it is the amortized cost, while RB-Trees have a constant amortized cost ie. O(1) for post insert rotations.Boice
@ChrisTaylor That's a very cool observation! Indeed in a red-black tree a maximum of 3 operations can be performed while in an AVL tree a maximum of O(logN) operations can be performed when we change the balance factors of logN nodes in the tree.Gnome
Please read comment and could you please correct your answer?. You are totally wrong and your answer just lead people to false affirmations. Most implementation uses Black-Red tree because in general case where you have more manipulation than consultation (read), red black tree is slightly faster because it require to re-balance less often. But when consultation (read) occurs more often than manipulation, AVL is faster because it is better balanced. Both rebalancing are in the same order for both, which is not O(1) but O(log n).Balmy
@EricOuellet - I think you might have misread, I am not saying the rebalance is O(1), it is the rotation step that is O(1) for R-B Tree (at most double left or right rotation required),Boice
Any rotation is always performed in O(1) in any tree rebalance action, there is no point to mention that, it brings nothing to the discussion. Any re-balance require between O(1) and O(log(n)) recursion in order to ensure to keep tree rules valid. You can read the excellent sample: cs.auckland.ac.nz/software/AlgAnim/red_black_op.html in order to see what I'm talking about. Like in AVL tree, you sometimes have to recurse to the top and re-order branches in order to keep RB rules valid. The difference lies in the coloring, the re-balancing will occurs about 2x less than in AVL.Balmy
Missing from all this is the cost, per-node, to store the auxilary information required to make balance decisions. Red-Black trees require 1-bit to represent the colour. AVL trees require at least 2 bits (to represent -1, 0 or 1).Fen
@EricOuellet: It is not true that re-balancing will occur 2x less than in AVL tree. For a RB tree, Inserts cost a maximum of 2 rotations and maximum of 3 * (log(n) /2) recolourations. AVL tree Inserts may cost log(n) rotations right up to the root. For a RB tree, Deletes cost a maximum of 3 rotations and up to log(n) recolorations, . AVL tree Deletes are the same as Inserts as cost. Rotations in RB Trees are O(1) because they bounded due to the log(n) recolorations where as rotations in AVL Trees are O(log(n)) there being no recoloration.Fen
N
55

It really depends on the usage. AVL tree usually has more rotations of rebalancing. So if your application doesn't have too many insertion and deletion operations, but weights heavily on searching, then AVL tree probably is a good choice.

std::map uses Red-Black tree as it gets a reasonable trade-off between the speed of node insertion/deletion and searching.

Negron answered 26/5, 2012 at 18:32 Comment(3)
Are you sure about that??? I personally think that Red-Black tree is either or more complex, never simpler. The only thing, is in Rd-Black tree, re-balancing occurs less often than AVL.Balmy
@Eric Theoretically, both R/B tree and AVL tree has complexity O(log n) ) for insertion and deletion. But one big part of the operation cost is rotation, which is different between these two trees. Please refer to discuss.fogcreek.com/joelonsoftware/… Quote: "balancing an AVL tree can require O(log n) rotations, whilst a red black tree will take at most two rotations to bring it into balance (though it may have to examine O(log n) nodes to decide where the rotations are necessary)." Edited my comments accordingly.Negron
Thanks a lot to bring my attention to the maximum rotation of 2 for insertion into a RB tree. You are right. I didn't realized that. Like you said, re-coloration happen in Log(n) but cost a lot less that rotation. I think your answer is great, I don't remember what was the previous one ;-). Thanks!!!Balmy
A
41

The previous answers only address tree alternatives and red black probably only remains for historical reasons.

Why not a hash table?

A type only requires < operator (comparison) to be used as a key in a tree. However, hash tables require that each key type has a hash function defined. Keeping type requirements to a minimum is very important for generic programming so you can use it with a wide variety of types and algorithms.

Designing a good hash table requires intimate knowledge of the context it which it will be used. Should it use open addressing, or linked chaining? What levels of load should it accept before resizing? Should it use an expensive hash that avoids collisions, or one that is rough and fast?

Since the STL can't anticipate which is the best choice for your application, the default needs to be more flexible. Trees "just work" and scale nicely.

(C++11 did add hash tables with unordered_map. You can see from the documentation it requires setting policies to configure many of these options.)

What about other trees?

Red Black trees offer fast lookup and are self balancing, unlike BSTs. Another user pointed out its advantages over the self-balancing AVL tree.

Alexander Stepanov (The creator of STL) said that he would use a B* Tree instead of a Red-Black tree if he wrote std::map again, because it is more friendly for modern memory caches.

One of the biggest changes since then has been the growth of caches. Cache misses are very costly, so locality of reference is much more important now. Node-based data structures, which have low locality of reference, make much less sense. If I were designing STL today, I would have a different set of containers. For example, an in-memory B*-tree is a far better choice than a red-black tree for implementing an associative container. - Alexander Stepanov

Should maps always use trees?

Another possible maps implementation would be a sorted vector (insertion sort) and binary search. This would work well for containers which aren't modified often but are queried frequently. I often do this in C as qsort and bsearch are built in.

Do I even need to use map?

Cache considerations mean it rarely makes sense to use std::list or std::deque over std:vector even for those situations we were taught in school (such as removing an element from the middle of the list). Applying that same reasoning, using a for loop to linear search a list is often more efficient and cleaner than building a map for a few lookups.

Of course choosing a readable container is usually more important than performance.

Alberik answered 22/12, 2017 at 0:46 Comment(0)
G
30

AVL trees have a maximum height of 1.44logn, while RB trees have a maximum of 2logn. Inserting an element in a AVL may imply a rebalance at one point in the tree. The rebalancing finishes the insertion. After insertion of a new leaf, updating the ancestors of that leaf has to be done up to the root, or up to a point where the two subtrees are of equal depth. The probability of having to update k nodes is 1/3^k. Rebalancing is O(1). Removing an element may imply more than one rebalancing (up to half the depth of the tree).

RB-trees are B-trees of order 4 represented as binary search trees. A 4-node in the B-tree results in two levels in the equivalent BST. In the worst case, all the nodes of the tree are 2-nodes, with only one chain of 3-nodes down to a leaf. That leaf will be at a distance of 2logn from the root.

Going down from the root to the insertion point, one has to change 4-nodes into 2-nodes, to make sure any insertion will not saturate a leaf. Coming back from the insertion, all these nodes have to be analysed to make sure they correctly represent 4-nodes. This can also be done going down in the tree. The global cost will be the same. There is no free lunch! Removing an element from the tree is of the same order.

All these trees require that nodes carry information on height, weight, color, etc. Only Splay trees are free from such additional info. But most people are afraid of Splay trees, because of the ramdomness of their structure!

Finally, trees can also carry weight information in the nodes, permitting weight balancing. Various schemes can be applied. One should rebalance when a subtree contains more than 3 times the number of elements of the other subtree. Rebalancing is again done either throuh a single or double rotation. This means a worst case of 2.4logn. One can get away with 2 times instead of 3, a much better ratio, but it may mean leaving a little less thant 1% of the subtrees unbalanced here and there. Tricky!

Which type of tree is the best? AVL for sure. They are the simplest to code, and have their worst height nearest to logn. For a tree of 1000000 elements, an AVL will be at most of height 29, a RB 40, and a weight balanced 36 or 50 depending on the ratio.

There are a lot of other variables: randomness, ratio of adds, deletes, searches, etc.

Gal answered 16/7, 2011 at 1:52 Comment(6)
Good answer. But if AVLs are the best, why standard library implements std::map as RB tree?Collimator
I disagree that AVL trees are unquestionably the best. Although they have a low height, they require (in total) more work to do relabancing than red/black trees (O(log n) rebalancing work versus O(1) amortized rebalancing work). Splay trees could be much, much better and your assertion that people are afraid of them is unfounded. There is no one universal "best" tree balancing scheme out there.Hire
Almost perfect answer. Why did you say AVL is the best. That is simply wrong and that's why most general implementation uses Red-Black tree. You need to have a pretty higher ratio of read over manipulation to choose AVL. Also, AVL has a little less memory footprint than RB.Balmy
I agree that AVL tends to be better in most cases, because usually trees are searched more often than they are inserted. Why is the RB tree so widely considered to be better when it is the one with a slight advantage in write-mostly case, and more importantly, a slight disadvantage in the read mostly case? Is it really believed that you'll insert more than you'll find?Obloquy
Downvoted for saying AVL trees are best "for sure". One must consider # reads vs # writes to determine if AVL is preferred.Nephograph
Excuse me. Isn't the scapegoat tree the simplest self-balancing BST?Pinson
O
4

It is just the choice of your implementation - they could be implemented as any balanced tree. The various choices are all comparable with minor differences. Therefore any is as good as any.

Overcoat answered 13/3, 2011 at 8:48 Comment(0)
B
3

Update 2017-06-14: webbertiger edit its answer after I commented. I should point out that its answer is now a lot better to my eyes. But I kept my answer just as additional information...

Due to the fact that I think first answer is wrong (correction: not both anymore) and the third has a wrong affirmation. I feel I had to clarify things...

The 2 most popular tree are AVL and Red Black (RB). The main difference lie in the utilization:

  • AVL : Better if ratio of consultation (read) is bigger than manipulation (modification). Memory foot print is a little less than RB (due to the bit required for coloring).
  • RB : Better in general cases where there is a balance between consultation (read) and manipulation (modification) or more modification over consultation. A slightly bigger memory footprint due to the storing of red-black flag.

The main difference come from the coloring. You do have less re-balance action in RB tree than AVL because the coloring enable you to sometimes skip or shorten re-balance actions which have a relative hi cost. Because of the coloring, RB tree also have higher level of nodes because it could accept red nodes between black ones (having the possibilities of ~2x more levels) making search (read) a little bit less efficient... but because it is a constant (2x), it stay in O(log n).

If you consider the performance hit for a modification of a tree (significative) VS the performance hit of consultation of a tree (almost insignificant), it become natural to prefer RB over AVL for a general case.

Balmy answered 30/5, 2017 at 20:33 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.