I've encountered some code that generates a number of UUIDs via UUID.randomUUID()
, takes the last 7 digits of each (recent versions of UUID are uniformly distributed in terms of entropy), and uses that as a key to insert rows into a database.
I wondered what the probability of collision was. I remembered the Birthday Problem. This is an instance of that problem, is it not? Instead of 365 days in a year, there are 16^7 possible strings. Then according to that Wikipedia page, the probability of collision given n strings is
where d is 16^7.
A colleague asserted that they were able to insert 50,000 rows without an issue. I doubted this because the probably of collision with n=50,000 and d=16^7 is
1-((16^7-1)/16^7)^(50000*(50000-1)/2) = 99.05%
But then I checked our database. Indeed the 50,000 inserts succeeded.
How was that possible? (Besides that they got really lucky.) Am I misapplying the Birthday Problem?
Edit
Here's a minimal test that shows there should have been collisions.
import java.util.UUID;
import java.util.HashSet;
public class MyClass {
public static void main(String args[]) {
final int N = 50000;
for (int trial = 0; trial < 10; trial++) {
HashSet<String> strings = new HashSet<>();
Boolean success = true;
for (int i = 0; i < N; i++) {
String uuid = UUID.randomUUID().toString();
String id = uuid.substring(uuid.length() - 7);
if (strings.contains(id)) {
success = false;
// System.out.println(id);
break;
}
strings.add(id);
}
System.out.println(success);
}
}
}
For N=50,000, 10 out of 10 attempts had a collision—which would be expected of a 99% collision rate. For N=10,000, I see 0, 1, 2, or 3 out of 10 attempts with collisions, which all fall within expectations for a 17% collision rate.
Finally I have an explanation. Thank you everyone for helpful ideas.
tl;dr - I thought there had had to have been a unique constraint at the time of insert, by the fact that indeed there were 50,000 distinct codes in the database. But it turned out there was not a constraint at that time. There indeed were duplicates among the 50,000 originally, that a month later were found and modified via a one-time SQL statement (modified by appending a "1").
Full explanation:
This code was about generating one-time use promo codes such as SUMMER23-65ACFF9. This is how I reasoned that indeed 50,000 distinct codes had been inserted into the database:
This table didn't have a timestamp field (e.g. last_modified) or I might have had a hint sooner. I only knew that the batch of 50,000 promo codes were inserted on January 6th, 2023—3 months ago.
I browsed the repo at the last commit before January 6th, 2023 to see if something about the code back then might have allowed 50,000 codes to succeed. The relevant code was this:
I believed that, because of the calls to .rollback(), the insert of 50,000 rows was being done atomically. In other words, if a single insert failed, all the inserts up until then should have been rolled back.
(So one possibility was that my colleague(s) just kept retrying and retrying until they hit the 1% jackpot. But when I asked them, that didn't seem to be the case. They did not recall having to retry at all.)
I did wonder if the constraint for unique promo_code_id existed at the time of insertion. I checked that it did exist:
I was hoping to find a timestamp for when this constraint was created but didn't immediately see one. I should have pursued that a little further, but I didn't because: I had already queried for a count of distinct promo_code_ids and gotten 50,000 (see first screenshot above). Constraint or no constraint, the probability of ending up with 50,000 distinct codes was still less than 1%. This is where I made a faulty assumption: that the codes hadn't been tampered with since.
Today I came across a Liquibase XML change in February (a month after the "successful" 50,000 inserts) where we apparently added the constraint:
But notice the SQL that comes with that change, besides adding the unique constraint. We essentially ran a script to add a "1" to the end of all duplicate codes. So that is how the "insert of 50,000 codes without duplicates" succeeded: It didn't—it was without constraint in the first place, and corrected afterward.