pythonalgorithmhashcryptographyurl-shortener

String conversion/shortening to a fixed length similar to url-shortener


I need to shorten a unique string ID to a maximum of 12 characters. The ID could be longer or shorter than 12 characters before the conversion but its length has to be shorter or equal to 12 after the conversion. It could also be represented by int or even float after conversion.

Using this function on the same string should always return the same shortened ID. However, it should never return the same value for two different IDs.

(I know, theoretically, this is not possible with a fixed number of output chars, but if it's reasonably unlikely to produce the same result twice, that's okay, because I am only dealing with a few thousand IDs.)

I was thinking of a hash function, but you can't really specify the length of the return value. A benefit would be reversibility of the function, as a URL shortener but I can also create a dictionary for that purpose.

Any hints to an algorithm that works in this scenario are appreciated!


Solution

  • Let's do some maths. With 12 case insensitive alphanumeric characters in the output, you will have 36 different output characters (26 letters + 10 numbers), and 36^12 possible different outputs. If the hash function is good, the entropy in that will be log2(36^12) = 62 bits.

    According to the birthday paradox though, the square root of that many possibilities will already yield a 50% chance of collision, ie. in 2^31 hashes there will very likely be one, 50% is a lot. 2^31 is not that much, a little more than 2 billion.

    With n hashes, with a perfect cryptographic hash function you will get a collision chance of p:

    n=1000: p=10^-13
    n=10000: p=10^-11
    n=100000: p=10^-9
    n=1000000: p=10^-7
    ...
    

    If you take the first several characters of a known good hash like SHA2, you will mostly be good. However, note that SHA2 output in a hex-encoded form has a lot less entropy, only 4 bits per character, so 12 characters of the hex representation of a hash output will only have (slightly less than) 48 bits of entropy. Using 1000 such values will have a little less than 1.77 * 10^-9 chance for a collision, 10000 will have 1.77 * 10^-7 chance, 100000 will be 1.77 * 10^-5, 1 million will already be in the 0.1% order of magnitude and so on.

    Only you can tell whether that's good enough.