databasemathhashprobabilityhash-collision

Probability of New Hash Collision, Conditional on No Current Collisions


I'm trying to understand the probability of collision of new hashes, given no collisions in the existing hash table yet.

For illustration, let's say I have a table where I store hashes of each row.

  1. The table currently has 1 billion rows
  2. There are no hash collisions amongst those 1 billion rows.
  3. I'm using a 64-bit hash algorithm.

Now imagine I insert 10 million new rows of data into the table. What is the probability that I have a hash collision now? I think the answer is the following:

Each new row's hash cannot have the same value of any of the existing rows or the new ones processed before itself. That removes 1 billion hash values from the 2^64 possibilities, so the probability of new collisions should be:

Does that sound right?


Solution

  • Thanks to President James K. Polk, I realized that my original solution was wrong. The probability of no collisions is

    Another way to think of it is just using the definition of conditional probability.

    ...which reduces to...

    ...which can be reduced to the product formula.

    The benefit of the conditional probability formula is that it can be easily estimated using any of the online hash collision probability calculators.