javascripthashcryptographyshahash-collision

How to generate a string hash, with a custom alphabet and length, while minimizing collisions


Assume I need to generate the hash for a string where the hash itself can be max N characters long in a given alphabet, e.g. all alphanumeric characters [a-zA-Z0-9] plus the symbols !?-=.

One trivial approach would be to use well-known hash algorithms, such as SHA-1, then truncate the output. Assuming N is 10 and the alphabet is a superset of hex, here is a trivial solution in Javascript:

var crypto = require('crypto')
var shasum = crypto.createHash('sha1')
shasum.update('foo')
var hash = shasum.digest('hex') // => "0beec7b5ea3f0fdbc95d0dd47f3c5bc275da8a33"
var shortHash = hash.substr(0, 10) // => "0beec7b5ea"

While this respects the conditions of length and alphabet, it is clearly not optimal because it does not exploit the (much) larger hashing space that the full alphabet above could allow.

Moreover, is the increment of collision probability of a truncated SHA-1 hash actually proportional to the reduction of the hashing space, or is it more than that (e.g. caused by internal correlations between bits)?

Disclaimer: This is not intended for security-critical applications, and I am aware of the increased collision probability. The goal of the question is purely to understand whether there is a theoretically optimal way of achieving what is described above.


Solution

  • After some research, here is the solution I've landed on in Node, using SHA-256 and Base-x.

    import crypto from "crypto";
    import basex from "base-x";
    
    const base62 = basex(
      "0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ"
    );
    
    const DEFAULT_LENGTH = 15;
    
    function shortHash(input: string, precision = DEFAULT_LENGTH) {
      return base62
        .encode(crypto.createHash("sha256").update(input).digest())
        .slice(0, precision);
    }
    

    How it works and assumptions

    1. First, the input is hashed using the Node crypto module. Here I use sha256 but other algorithms are possible too.
    2. Then, the resulting buffer is encoded to base 62. Here the assumption is that the desired alphabet is alphanumeric characters (62 symbols). A different base would need to be chosen with a different number of symbols.
    3. The result is sliced to the desired length. Here we rely on the assumption that any substring of the sha256 output has the same entropy. However, I could not find theoretical results showing that this is optimal.