distributed-systemcrdt

CRDTS: Delta-state Add-Wins-Observed-Removed Map


I found a nice introductory article on delta-state CRDTs here.

Their approach uses a map with a root version vector and tags each key with a "dot" list. The version vector is standard - a map from replica ids to version integers. A dot represents a single edit - represented by a pair (replica id, version). Each entry in the version vector summarizes a series of edits or "dots". I've seen several other approaches in various research papers that use a very similar approach to this.

To calculate a diff a replica will send its root version vector to a peer. The peer will walk through each key in its map, looking at the dot list for each of them. If any of the dots in this dot list are not "covered" by the incoming version vector, then the keys & values associated with those dots must be packaged and sent to the peer.

The CRDT map is composable, so a particular value could be implemented using a LWW (last-write-wins) register, a MVV register, a CRDT set, another CRDT map, etc.

There are three things that aren't clear to me.

  1. Why is a dot list used for each key instead of a single dot? The article mentions that multiple dots are present because of concurrent writes to the same key. Let's assume the entry is a LWW CRDT. From my understanding of the article, if two replicas concurrently write to that key, after a sync, each replica will end up with a dot list containing two dots (and the two values associated with the dots). When the map key is read, the value associated with the dot having the maximum timestamp is returned. Why wouldn't this be done from the get go? In other words, why wouldn't the dot having an earlier timestamp be discarded, and the value overwritten? The root version vectors are still merged, so we know that the map has incorporated the changes from the earlier edit in its history. I don't see any reason this wouldn't work.
  2. Is the dot list ever shrunk? To me, it seems that the dot list for each key will eventually decay to a full version vector. My intuition would be that when a replica makes a local edit, it would overwrite the dot list with a single, new dot, but I'm not certain of this.
  3. At the end of the article, the author mentions that:

[In the] old Remove Wins Map the tombstone was a single dot. With Add Wins Maps this is more complex, and that is the subject of my next post!

It's not clear to why anything special needs to be done to get add wins behavior. In my mind, each element in the set would be tagged with a single dot. When a remove occurs, the item is tagged with a tombstone and a single dot. Because the version vectors and dots will tell us when a concurrent update and remove occurred, we can simple discard the tombstone.

Thank you in advance for your help clearing up these questions.


Solution

  • I would recommend contacting the author for definite answers (the post is rather sparse), but here is my attempt at interpreting it:

    1. The only example they show with multiple dots is actually a counter, not an LWW register. The traditional state-based vector counter always stores the most recent dot from each replica.

      For the LWW register, they might need to store all dots to handle removals properly. This is definitely the case for Add-Wins removals [1], but I'm not sure for their Remove-Wins removals.

    2. I agree, it seems like they are using a causal-overwrite rule (a new dot overwrites causally prior dots). Otherwise, I would expect the root in each figure to contain all causal dots that appear in the tree.

    3. The traditional algorithm for Add-Wins removals is [2]:

      • Store the set of all events that have not been deleted, plus a global version vector. (As they are already doing.)
      • When you receive an event from a peer, only add it to your state if it's not subsumed by your global version vector. (Otherwise, you already removed it.)
      • When you receive a whole state from a peer, delete any events that are subsumed by the remote version vector but not present in the remote state. (The remote peer must have witnessed its removal.)

      I think this is what you are getting it; perhaps the author just considers it more complicated? Or, there could be complications involving the tree structure / diff procedure.

    [1] "Assessing the understandability of a distributed algorithm by tweeting buggy pseudocode", Martin Kleppmann, https://doi.org/10.48456/tr-969

    [2] "An optimized conflict-free replicated set", Annette Bieniusa et al., https://arxiv.org/abs/1210.3368