Efficient re-hashing of a rope
Asked Answered
V

1

6

Given a rope, let's say we need to know its hash (by passing the concatenation of all leaves through some hash function).

Now, when one rope leaf changes, what's an efficient way to recalculate the hash of the entire rope again? I.e. something like O(log n) instead of O(n).

One way would be to use a Merkle tree. However, this results in issues such as...

  • Empty non-leaf nodes, or leaf-nodes with zero-length substrings, affect the hash even though they have no effect on the effective rope contents;
  • Moving nodes from the right of a subtree to the left of that subtree's right-side sibling affects the final hash, but not the effective rope contents.

Is there a better algorithm for this? The hash function needn't be cryptographically secure, just good enough to avoid likely collisions.

Vulpine answered 8/2, 2017 at 9:58 Comment(0)
L
3

Just as any node of a rope stores the size of the left subtree (or itself if it is a leaf), any node can additionally store the polynomial hash of the string corresponding to the left subtree (or itself if it is a leaf).

When weight is recalculated for a node, hash is also recalculated for that node, with the same asymptotic complexity.

For example, let the nodes and the values in them be:

    left     right    string     weight
1:                     abcd         4
2:    1        4                    4
3:                     ef           2
4:    3        5                    2
5:                     ghi          3

The polynomial hash is, with some fixed constants p and q:

h (s[0] s[1] ... s[n-1]) = (s[0] * p^(n-1) + s[1] * p^(n-2) + ... + s[n-1] * p^0) mod q.

So, we have the following hashes stored, all modulo q:

         hash
1:  a*p^3 + b*p^2 + c*p^1 + d*p^0
2:  a*p^3 + b*p^2 + c*p^1 + d*p^0
3:  e*p^1 + f*p^0
4:  e*p^1 + f*p^0
5:  g*p^2 + h*p^1 + i*p^0

A note about calculation modulo q. Here and below, all additions and multiplications are carried out modulo q. In other words, we operate in the ring of integers modulo q. We use the fact that

(a ? b) mod q = ((a mod q) ? (b mod q)) mod q

for the ? operation being addition, subtraction and multiplication. Thus, every time we do one of these operations, we immediately append a mod q to keep the numbers small. For example, if p and q are less than 230 = 1,073,741,824, addition and subtraction can be done in a 32-bit integer type, and multiplication will be fine with an intermediate 64-bit integer type. After each multiplication, we immediately take the result modulo q, making it fit into 32-bit integer again.


Now, how do we get the hash of the root - for example, to make it the left child of some node, or just to get the hash of the whole string?

We go from the root and to the right, and we have to add weights and merge hashes. Turns out we can just do (remember that everything is modulo q):

({a*p^3 + b*p^2 + c*p^1 + d*p^0} * p^2 + {e*p^1 + f*p^0}) * p^3 + {g*p^2 + h*p^1 + i*p^0}

The values in curly brackets are the values stored in out nodes. We recurse to the right. When getting up, we remember the weight collected so far, multiply the left-side hash by p to the power of that weight (that's where p^3 and p^(3+2=5) come from), and add the accumulated right-side hash.

The resulting value is equal to just the hash of the whole string:

a*p^8 + b*p^7 + c*p^6 + d*p^5 + e*p^4 + f*p^3 + g*p^2 + h*p^1 + i*p^0


A few notes here.

  1. We have to precalculate, maybe lazily, the powers of p modulo q to be able to multiply by them fast.

  2. The whole construction may become more clear if we store the hash of the whole subtree, not only of the left subtree, in a node. However, this way, we are likely to lose that O(1) concatenation possibility that the rope structure has, making it down to the usual O(log n), so that we may just have used a regular treap instead of a rope. Even if not, caching the hash value of the whole subtree in a node is definitely a possibility.

  3. If we reverse the order of powers in the hashing polynomial, making it
    h (s[0] s[1] ... s[n-1]) = (s[0] * p^0 + s[1] * p^1 + ... + s[n-1] * p^(n-1)) mod q,
    the math is similar, but collecting the hash from all the right descendants of a node can be done iteratively instead of recursively.

Laconic answered 8/2, 2017 at 12:7 Comment(7)
Thanks for the quick and detailed answer. However, I don't understand how this would work considering the hash stored in each node is modulo q. I.e. each value in curly blackets is modulo q, so {ap^3 + bp^2 + cp^1 + dp^0} * p^2 + {ep^1 + fp^0} may not be necessarily equal to ({ap^3 + bp^2 + cp^1 + dp^0} mod q) * p^2 + ({ep^1 + fp^0} mod q), unless q is very large.Vulpine
Err, to clarify, {ap^3 + bp^2 + cp^1 + dp^0} * p^2 + {ep^1 + fp^0} == {ap^1 + bp^0} * p^4 + {cp^3 + dp^2 + ep^1 + fp^0} but not if you apply modulo q to the values in curly brackets.Vulpine
@VladimirPanteleev Every addition and multiplication is meant to be modulo q, that is, we operate in Z/qZ instead of all integers. Sorry, I'll edit to make it more clear.Laconic
Awesome, thanks! Here is my implementation of just the hashing (using integer overflow as q). I'm not versed in number theory, any hints for choosing a good p? My limited research led me that the best p would be a primitive root modulo q, but that's as far as I got.Vulpine
@VladimirPanteleev First, q being a power of two has a certain performance appeal to it, but it performs poorly with Thue-Morse string: ru, en. If your application can naturally encounter such strings (text editor? library?), it would be better to pick a large prime for q. Second, p should be larger than the size of the alphabet (one byte?) and relatively prime with q, but apart from that, most p are fine.Laconic
@VladimirPanteleev The largest order of an element modulo 2^k for k>1 is 2^{k-2}, so your order 2^30 for 2^32 is the best there is. Actually, for sufficiently large q, there are many elements of the largest order in Z/qZ, so one can be found if you just code how to find the order of an element, and then try maybe a few hundred random elements and take the maximum.Laconic
@VladimirPanteleev Lastly, for comparison, here is a discussion on polynomial hashes for use in competitive programming (sorry, Russian only). Your use case is perhaps different, but some of the points may be relevant.Laconic

© 2022 - 2024 — McMap. All rights reserved.