securitymathhashcryptographynon-uniform-distribution

Why does taking the salted hash of the mod of a hash result in a very non-uniform distribution?


I have a million randomly generated unique IDs.

If I do:

result = int(hash(id + 'some_salt')) % 1000

Then this seems to result in an even distribution of IDs to some integer between 0 and 999, with each integer having approximately 1000 IDs mapped to it.

If I now append some salt to this and take the hash again:

x = int(hash(id)) % 1000
result = int(hash(str(x) + 'some_salt') % 1000)

Then the resulting distribution is completely non-uniform. For each ID, the result is of course in the range of [0,999] but some integers in this range have zero IDs mapped to them, while others have several thousand.

Why does this result in a very non-uniform distribution of values?

How can I adjust this to result in a uniform distribution of integers in the range [0,999] for my million IDs, and any given salt? I want to keep the intermediate step of reducing the potentially very large input space to some much smaller space (e.g. of size 1000).

I'm using SHA-256 hashing.

Here is some Python code which demonstrates the very non-uniform results:

import numpy as np
import hashlib

OUTPUT_RANGE_SIZE = 1000

unique_ids = xrange(1000000) # sequential here, but could be any kind of unique ids
frequencies = np.zeros(OUTPUT_RANGE_SIZE, dtype='int')

for idx in xrange(len(unique_ids)):
    id = unique_ids[idx]
    hash_mod = int(hashlib.sha256(str(id)).hexdigest(), 16) % 1000
    result = int(hashlib.sha256(str(hash_mod) + 'some_salt').hexdigest(), 16) % OUTPUT_RANGE_SIZE
    frequencies[result] = frequencies[result] + 1

print frequencies

Solution

  • By applying the modulo operator on your first hash operation, you've ensured that there are only 1000 unique outputs from that stage, regardless of how many unique numbers you had as inputs. When you hash it and modulo it again, by chance some of those hashes will map to the same buckets; as a result the number of values in the bucket will be roughly 1000 times the number of values that hashed to that bucket ID. You can see this by dividing your values in the frequencies array by 1000:

    [1, 0, 2, 1, 0, 0, 0, ...]
    

    If you remove the modulo operator from the first step, your output values in the second step will be evenly distributed as expected.

    Obligatory postscript: Don't invent your own cryptosystems. If this is security critical, learn about the best practices and implement them.