storagehash-collisionfuture-proofcamlistore

How do content addressable storage systems deal with possible hash collisions?


Content addressable storage systems use the hash of the stored data as the identifier and the address. Collisions are incredibly rare, but if the system is used a lot for a long time, it might happen. What happens if there are two pieces of data that produce the same hash? Is it inevitable that the most recently stored one wins and data is lost, or is it possible to devise ways to store both and allow accessing both?

To keep the question narrow, I'd like to focus on Camlistore. What happens if permanodes collide?


Solution

  • It is assumed that collisions do not happen. Which is a perfectly reasonable assumption, given a strong hash function and a casual, non-malicious user inputs. SHA-1, which is what Camlistore currently uses, is also resistant to malicious attempts to produce collision.

    In case a hash function becomes weak with time and needs to be retired, Camlistore supports a migration to a new hash function for new blobrefs, while keeping old blob refs accessible.

    If a collision did happen, as far as I understand, the first stored blobref with that hash would win.

    source: https://groups.google.com/forum/#!topic/camlistore/wUOnH61rkCE