pythonalgorithmdictionaryhashmaphash-collision

Dictionary/Hashmap implementation using double hashing is stuck in an infinite loop


I'm following this formula from wikipedia:

H(i, k) = (H1(k) + i*H2(k)) % size

and my H1 is Python's built-in hash() function.

H2 is:

PRIME - (H1(k) % PRIME)

Unfortunately it randomly sticks in an infinite loop after a couple of execution. It cannot traverse all the slots in my table.

Here is my code but you have to set PYTHONHASHSEED=12 in order to reproduce this bug. (I deliberately removed many details so that the implementation would be minimal)

EMPTY = object()

class DoubleHashingHashMap:
    def __init__(self):
        self.prime = 7
        self.size = 15
        self.slots = [EMPTY] * self.size

    def __setitem__(self, key, value):
        for idx in self.probing_squence(key):
            slot = self.slots[idx]
            if slot is EMPTY:
                self.slots[idx] = (key, value)
                break
            elif isinstance(slot, tuple):
                k, v = slot
                if k == key:
                    self.slots[idx] = (key, value)
                    break

    def probing_squence(self, key):
        h1 = self.hash_func1(key) % self.size
        h2 = self.hash_func2(key) % self.size
        i = 1
        while True:
            yield (h1 + i*h2) % self.size
            i += 1

    def hash_func1(self, item):
        return hash(item)

    def hash_func2(self, item):
        return self.prime - (self.hash_func1(item) % self.prime)

hashmap = DoubleHashingHashMap()
for i in range(8):
    hashmap[str(i)] = i
print("8 items added.")
print("Going into the infinite loop when adding 9th item(which is 8)...")
hashmap["8"] = 8
print("This line can't be reached.")

I would appreciate if you tell me what's wrong with my math.


Solution

  • The logic calculating the sequence is flawed. For the configuration you mentioned it just will output 0, 5, 10 forever since the 0, 5, 10 slots are already occupied this will go on forever. You only multiply h2 with i and do the modulo with the size. This will loop quite often through a few specific values and won't cover all possible indexes.

    This is what happens in your case

    # h1 = 10, h2 = 5, calculating the first 10 outputs you would get
    print((10 + np.arange(10) * 5) % 15)
    array([10,  0,  5, 10,  0,  5, 10,  0,  5, 10])
    

    So this actually loops through only 3 values, quite bad with 15 possible ones. Probably the reason why this bug happens so fast.

    With how your implement it you can just increase the index by one and do this until a slot is empty and in the __getitem__ you need to check if the key requested matches the key in the slot and if not do the same logic by increasing it by one until you find it.

    EMPTY = object()
    
    
    class DoubleHashingHashMap:
        def __init__(self):
            self.prime = 7
            self.size = 15
            self.slots = [EMPTY] * self.size
    
        def __setitem__(self, key, value):
            for idx in self.probing_squence(key):
                slot = self.slots[idx]
                if slot is EMPTY:
                    self.slots[idx] = (key, value)
                    break
                elif isinstance(slot, tuple):
                    k, v = slot
                    if k == key:
                        self.slots[idx] = (key, value)
                        break
    
        def __getitem__(self, key):
            for idx in self.probing_squence(key):
                slot = self.slots[idx]
                if slot is not EMPTY and slot[0] == key:
                    return slot[1]
    
        def probing_squence(self, key):
            h1 = self.hash_func1(key) % self.size
            h2 = self.hash_func2(key) % self.size
            i = 0
            while True:
                yield (h1 + h2 + i) % self.size
                i += 1
    
        def hash_func1(self, item):
            return hash(item)
    
        def hash_func2(self, item):
            return self.prime - (self.hash_func1(item) % self.prime)
    
    
    hashmap = DoubleHashingHashMap()
    for i in range(8):
        hashmap[str(i)] = i
    print("8 items added.")
    print("Going into the infinite loop when adding 9th item(which is 8)...")
    hashmap["8"] = 8
    print("This line can't be reached.")
    print(hashmap["1"], hashmap["8"])
    

    So this fixes it but probably not in the way you want since you reference the wikipedia.

    So why does the formula from wikpedia not work in your case. This is probably because your h2 does not have all needed characteristics.

    The Wikipedia you linked says

    The secondary hash function h2(k) should have several characteristics:

    • it should never yield an index of zero

    • it should be pair-wise independent of h1k

    • it should cycle through the whole table

    • All h2(k) be relatively prime to the size

    Your h2 actually has only the first characteristics. It can't be 0. It is definitely dependent on h1 since you use h1 to calculate h2. It won't cycle through the whole table since your self.prime < self.size. It can definitely output e.g. 5, which is not relative prime to a total size of 15. They both share the factor of 5.

    As said in the article to e.g. have the relative prime characteristic you can have the total size be a power of 2 and only ever return odd numbers from h2. This will automatically make it relatively prime. You should not use h1 to calculate h2 to make them independent and make sure the outputs of h2 are in the interval [1, size - 1].

    So if you want to apply the hashing rule you need to make sure your h2 actually has the characteristics needed. Otherwise the closed loop of a few numbers will happens as you observed.