algorithmperformancehashhashtablelinear-probing

Linear probing huge sequences of keys with unequal hash


There is one thing about linear probing (hash tables) that is not intuitive to me. If I put key1 which hash results to array index 1. Then I put key2 -> array index 2. Then I put key3 -> again array index 1, this will go to array index 3. Then when I search for key3 I should go through indexes that contain keys that do not have the same hash as mine at all. Isn't this waste? If the sequence is really big and contains many keys (for example I have 20 elements, then null, for any key that results in array index from 0 to 20 I must go through all the indexes although they do not have the same hash as mine and I can eliminate this with separate chaining).

Or this is mitigated by the fact that our hashing function (if written well enough) distributes the keys equally among the indices and we constantly resize the array to be max half full?


Solution

  • Linear probing is sub-optimal when there are many collisions. Note that the number of collisions doesn't only depend on the hash, but also on the number of slots in the table (usually a prime number) because the index is the remainder of the integer division of the hash by the table length.

    Note however, than having the colliding keys one next to the other might also take advantage of the CPU caches, which will bring from RAM many elements in one read. So, do not (in principle) think that the time it takes to check 20 probes is 20 times the time it takes to check one, because what happens inside the CPU and its caches is much faster than going to RAM. There is no magic though. If the computation of every comparison throws away what's in the caches, then part of the savings will be lost.