consistent hashing vs. rendezvous (HRW) hashing - what are the tradeoffs?
Asked Answered
I

2

19

There is a lot available on the Net about consistent hashing, and implementations in several languages available. The Wikipedia entry for the topic references another algorithm with the same goals:

Rendezvous Hashing

This algorithm seems simpler, and doesn't need the addition of replicas/virtuals around the ring to deal with uneven loading issues. As the article mentions, it appears to run in O(n) which would be an issue for large n, but references a paper stating it can be structured to run in O(log n).

My question for people with experience in this area is, why would one choose consistent hashing over HRW, or the reverse? Are there use cases where one of these solutions is the better choice?

Many thanks.

Indication answered 26/12, 2013 at 20:33 Comment(0)
O
19

Primarily I would say the advantage of consistent hashing is when it comes to hotspots. Depending on the implementation its possible to manually modify the token ranges to deal with them.

With HRW if somehow you end up with hotspots (ie caused by poor hashing algorithm choices) there isn't much you can do about it short of removing the hotspot and adding a new one which should balance the requests out.

Big advantage to HRW is when you add or remove nodes you maintain an even distribution across everything. With consistent hashes they resolve this by giving each node 200 or so virtual nodes, which also makes it difficult to manually manage ranges.

Organicism answered 6/1, 2014 at 18:33 Comment(2)
Thanks, that's pretty much what I had come up with. Surprising there isn't more written about HRW usage out there.Indication
Well, HRW implies usage of weight, which pretty much deals with what I assume you mean by hotspots. You probably won't use ConsistentHashing without virtual nodes anyway, and an easy way to manage ranges there is to modify the number of virtual nodes depending on weigth/load so I really don't see this as the primary difference. From what I have gathered in reading and experience, the key distribution is more uniform in HRW (where CH needs many more keys to flatten out). I also see stable redistribution after node loss (only lost records need redistribution) as key advantages (pun intended) .Santosantonica
P
11

Speaking as someone who's just had to choose between the two approaches and who ultimately plumped for HRW hashing: My use case was a simple load balancing one with absolutely no reassignment requirement -- if a node died it's perfectly OK to just choose a new one and start again. No re balancing of existing data is required.

1) Consistent Hashing requires a persistent hashmap of the nodes and vnodes (or at least a sensible implementation does, you could build all the objects on every request.... but you really don't want to!). HWR does not (it's state-less). Nothing needs to be modified when machines join or leave the cluster - there is no concurrency to worry about (except that your clients have a good view of the state of the cluster which is the same in both cases)

2) HRW is easier to explain and understand (and the code is shorter). For example this is a complete HRW algorythm implemented in Riverbed Stingray TrafficScript. (Note there are better hash algorithms to choose than MD5 - it's overkill for this job)

$nodes = pool.listActiveNodes("stingray_test");

# Get the key
$key = http.getFormParam("param");

$biggest_hash = "";
$node_selected = "";

foreach ($node in $nodes) {
   $hash_comparator = string.hashMD5($node . '-' . $key);

   # If the combined hash is the biggest we've seen, we have a candidate
   if ( $hash_comparator > $biggest_hash ) {
      $biggest_hash = $hash_comparator;
      $node_selected = $node;
   }
}

connection.setPersistenceNode( $node_selected );
​

3) HRW provides an even distribution when you lose or gain nodes (assuming you chose a sensible hash function). Consistent Hashing doesn't guarantee that but with enough vnodes it's probably not going to be an issue

4) Consistent Routing may be faster - in normal operation it should be an order Log(N) where N is the number of nodes * the replica factor for vnodes. However, if you don't have a lot of nodes (I didn't) then HRW is going to be probably fast enough for you.

4.1) As you mentioned wikipedia mentions that there is a way to do HWR in log(N) time. I don't know how to do that! I'm happy with my O(N) time on 5 nodes.....

In the end, the simplicity and the stateless nature of HRW made the choice for me....

Prostrate answered 7/7, 2014 at 14:0 Comment(3)
You should probably sort the list of nodes in some way before doing the linear search. Otherwise the choice of node will be inconsistent if there is a hash collision. Very unlikely in your case with md5 (but still possible). If the selected hash function gets smaller (e.g. xxhash 32bit) then this becomes more likely.Riven
of course perhaps pool.listActiveNodes does that already but its not clear in your snippetRiven
It is also important to note that although O(N) the per-N cost can be much lower with HRW. The node hashes can be pre-computed and saved, using one hash algorithm, and the key hashes if using a different, quality algorithm, can simply be XOR'd with each node and compared. XOR is a safe mixing function if the two hash algorithms are distinct and of decent quality (e.g. nodes use SHA-2, keys use SipHash). With Consistent Hashing, you need a tree or binary search, on 200 * N, for log(200 * N) compares or so, and significantly worse locality of reference once N is larger.Engrossing

© 2022 - 2024 — McMap. All rights reserved.