cryptographyhash-collision

Likelihood of Collision


I want to hash an internal account number and use the result as a unique public identifier for an account record. The identifier is limited to 40 characters. I have approximately 250 records with unique account numbers.

What is less likely to result in a collision.

  1. Taking the SHA-1 of the SHA-256 hash of the account number.
  2. Taking the SHA-256 of the account number and picking out 40 characters.

Solution

  • These approaches are identical (*), so you should use the second one. There is no reason to inject SHA-1 into the system. Any selection of bits out of SHA-256 are independent and "effectively random."

    An alternate solution that may be convenient is to turn these into v5 UUIDs. If you keep your namespace secret (which is allowed), this may be a very nice way to do what you're describing.

    (*) There are some subtleties around the fact that your using "characters" rather than bytes here, and you could get a larger space in 40 "characters" by using a better encoding than you're likely using. It's possible the spaces are a little different based on how you're actually encoding. But it deeply doesn't matter. These spaces are enormous, and the two approaches will be the same in practice, so use the one that only needs one algorithm.


    Another approach that may meet your needs better is to stretch the identifiers. If the space is sufficiently sparse (i.e if the number of possible identifiers is dramatically larger than the number of actually used identifiers), then stretching algorithms like PBKDF2 are designed to handle exactly that. They are expensive to compute, but you can tune their cost to match your security requirements.

    The general problem with just hashing is that hashes are very fast, and if your space of possible identifiers is very small, then it's easy to brute force. Stretching algorithms make the cost of guessing arbitrarily expensive, so large spaces are impractical to brute force. They do this without requiring any secrets, which is nice. The general approach is:

    The number of iterations tunes how slow this operation is. The output is "effectively random" (just like a hash) and can be used in the same ways.

    A common target for iterations is a value that delivers around 80-100ms. This is fairly fast on the server, but is extremely slow for brute-forcing large spaces, even if the attacker has better hardware than yours. Ideally your space should take at least millions of years to brute force (seriously; this is the kind of headroom we typically like in security; I personally target trillions of years). If it's smaller than a few years, then it probably can be brute forced quickly by throwing more hardware at it.

    (Of course all of these choices can be turned based on your attack model. It depends on how dedicated and well-funded you expect attacks to be.)