[SOLVED] Efficient re-hashing of a rope

Efficient re-hashing of a rope

Given a rope, let's say we need to know its hash (by passing the concatenation of all leaves through some hash function).

Now, when one rope leaf changes, what's an efficient way to recalculate the hash of the entire rope again? I.e. something like O(log n) instead of O(n).

One way would be to use a Merkle tree. However, this results in issues such as...

Empty non-leaf nodes, or leaf-nodes with zero-length substrings, affect the hash even though they have no effect on the effective rope contents;
Moving nodes from the right of a subtree to the left of that subtree's right-side sibling affects the final hash, but not the effective rope contents.

Is there a better algorithm for this? The hash function needn't be cryptographically secure, just good enough to avoid likely collisions.

Solution

Just as any node of a rope stores the size of the left subtree (or itself if it is a leaf), any node can additionally store the polynomial hash of the string corresponding to the left subtree (or itself if it is a leaf).

When weight is recalculated for a node, hash is also recalculated for that node, with the same asymptotic complexity.

For example, let the nodes and the values in them be:

    left     right    string     weight
1:                     abcd         4
2:    1        4                    4
3:                     ef           2
4:    3        5                    2
5:                     ghi          3

The polynomial hash is, with some fixed constants p and q:

h (s[0] s[1] ... s[n-1]) = (s[0] * p^(n-1) + s[1] * p^(n-2) + ... + s[n-1] * p^0) mod q.

So, we have the following hashes stored, all modulo q:

         hash
1:  a*p^3 + b*p^2 + c*p^1 + d*p^0
2:  a*p^3 + b*p^2 + c*p^1 + d*p^0
3:  e*p^1 + f*p^0
4:  e*p^1 + f*p^0
5:  g*p^2 + h*p^1 + i*p^0

A note about calculation modulo q. Here and below, all additions and multiplications are carried out modulo q. In other words, we operate in the ring of integers modulo q. We use the fact that

(a ? b) mod q = ((a mod q) ? (b mod q)) mod q

for the ? operation being addition, subtraction and multiplication. Thus, every time we do one of these operations, we immediately append a mod q to keep the numbers small. For example, if p and q are less than 2³⁰ = 1,073,741,824, addition and subtraction can be done in a 32-bit integer type, and multiplication will be fine with an intermediate 64-bit integer type. After each multiplication, we immediately take the result modulo q, making it fit into 32-bit integer again.

Now, how do we get the hash of the root - for example, to make it the left child of some node, or just to get the hash of the whole string?

We go from the root and to the right, and we have to add weights and merge hashes. Turns out we can just do (remember that everything is modulo q):

({a*p^3 + b*p^2 + c*p^1 + d*p^0} * p^2 + {e*p^1 + f*p^0}) * p^3 + {g*p^2 + h*p^1 + i*p^0}

The values in curly brackets are the values stored in out nodes. We recurse to the right. When getting up, we remember the weight collected so far, multiply the left-side hash by p to the power of that weight (that's where p^3 and p^(3+2=5) come from), and add the accumulated right-side hash.

The resulting value is equal to just the hash of the whole string:

a*p^8 + b*p^7 + c*p^6 + d*p^5 + e*p^4 + f*p^3 + g*p^2 + h*p^1 + i*p^0

A few notes here.

We have to precalculate, maybe lazily, the powers of p modulo q to be able to multiply by them fast.
The whole construction may become more clear if we store the hash of the whole subtree, not only of the left subtree, in a node. However, this way, we are likely to lose that O(1) concatenation possibility that the rope structure has, making it down to the usual O(log n), so that we may just have used a regular treap instead of a rope. Even if not, caching the hash value of the whole subtree in a node is definitely a possibility.
If we reverse the order of powers in the hashing polynomial, making it
h (s[0] s[1] ... s[n-1]) = (s[0] * p^0 + s[1] * p^1 + ... + s[n-1] * p^(n-1)) mod q,
the math is similar, but collecting the hash from all the right descendants of a node can be done iteratively instead of recursively.