data-structureshashuniversal-hashing

Use of universal hashing


I'm trying to understand the usefullness of universal hashing over normal hashing, other than the function is randomly produced everytime, reading Cormen's book.

From what i understand in universal hashing we choose the function to be

H(x)=[(ax+b)mod p]mod m

with p being a prime number larger than all the keys, m the size of the data table, and a,b random numbers.

So for example if i want to read the ID of 80 people, and each ID has a value between [0,200], then m would be 80 and p would be 211(next prime number). Right? I could use the function lets say

H(x)=[(100x+50)mod 211]mod 80

But why would this help? There is a high chance that i'm going to end up having a lot of empty slots of the table, taking space without reason. Wouldn't it be more usefull to lower the number m in order to get a smaller table so space isn't used wtihout reason?

Any help appreciated


Solution

  • I think the best way to answer your question is to abstract away from the particulars of the formula that you're using to compute hash codes and to think more about, generally, what the impact is of changing the size of a hash table.

    The parameter m that you're considering tuning adjusts how many slots are in your hash table. Let's imagine that you're planning on dropping n items into your hash table. The ratio n / m is called the load factor of the hash table and is typically denoted by the letter α.

    If you have a table with a high load factor (large α, small m), then you'll have less wasted space in the table. However, you'll also increase the cost of doing a lookup, since with lots of objects distributed into a small space you're likely to get a bunch of collisions that will take time to resolve.

    On the other hand, if you have a table with a low load factor (small α, large m), then you decrease the likelihood of collisions and therefore will improve the cost of performing lookups. However, if α gets too small - say, you have 1,000 slots per element actually stored - then you'll have a lot of wasted space.

    Part of the engineering aspect of crafting a good hash table is figuring out how to draw the balance between these two options. The best way to see what works and what doesn't is to pull out a profiler and measure how changes to α change your runtime.