algorithmhashmaphash-code-uniqueness

Simple hashcode in hashmap misconception?


I am implementing my own specialized hashmap which has generic value types, but keys are always of type long. Here and there, I am seeing people suggesting that I should multiply key by a prime and then get modulo by number of buckets:

int bucket = (key * prime) % numOfBuckets;

and I don't understand why? It seems to me that it has exactly the same distribution as simple:

int bucket = key % numOfBuckets;

For example, if numOfBuckets is 8, with second "algorithm" we get buckets like {0, 1, 2, 3, 4, 5, 6, 7} repeating for key = 0 to infinity. In first algorithm for same keys we get buckets {0, 3, 6, 1, 4, 7, 2, 5} (or similar) also repeating. Basically we have the same problem like when using identity hash.

Basically, in both cases we get collisions for keys:

key = x + k*numOfBuckets (for k = 1 to infinity; and x = key % numOfBuckets)

because when we get modulo by numOfBuckets we always get x. So, what's the deal with first algorithm, can someone enlighten me?


Solution

  • If numOfBuckets is a power of two and the prime is odd (which seems to be the intended use case), then we have gcd(numOfBuckets, prime) == 1. That in turn means there is a number inverse such that inverse * numOfBuckets = 1 (mod numOfBuckets), so the multiplication is a bijective operation that just shuffles the buckets around a bit. That is of course useless, so your conclusions are correct.

    Or perhaps more intuitively: in a multiplication information only flows from the lowest bit to the highest, never in reverse. So any of the bits that the bucket index would not rely on without the multiplication, are still discarded with the multiplication.

    Some other techniques do help, for example Java's HashMap uses this:

    /**
     * Applies a supplemental hash function to a given hashCode, which
     * defends against poor quality hash functions.  This is critical
     * because HashMap uses power-of-two length hash tables, that
     * otherwise encounter collisions for hashCodes that do not differ
     * in lower bits. Note: Null keys always map to hash 0, thus index 0.
     */
    static int hash(int h) {
        // This function ensures that hashCodes that differ only by
        // constant multiples at each bit position have a bounded
        // number of collisions (approximately 8 at default load factor).
        h ^= (h >>> 20) ^ (h >>> 12);
        return h ^ (h >>> 7) ^ (h >>> 4);
    }
    

    An other thing that works is multiplying by some large constant and then using the upper bits of the result (which contain a mixture of the bits below them, so all bits of the key can be used that way).