mysqlprimary-keycrc32primary-key-designshared-primary-key

crc32 of natural key as a primary key


I have a database of TCG cards and I am trying to decide a primary key. I settled with a surrogate key initially but I realized that sometimes, there are cards that I forget to add, like promo cards for example. This is problematic with the surrogate key because they get added to the database with the latest auto increment and I didn't want their IDs to depend on the order they were inserted. I was thinking maybe I could do a hash on some of the card features and use that as the primary key instead?

Take for example the following pseudo-code:

// set code, date released, collector number, name
$crc = crc32(implode(',', ['A', '1993-08-03', '232a', 'black lotus']));
echo $crc; // 4199975187

The possible amount of cards hovers just around 25k now and grows around 100-300 every 6 months.

  1. At this rate, there won't be a collision right?
  2. Is this good practice? Do I have any other good alternatives?

I know I can make the hash shorter by converting it to base 62 but I will be joining these to the users' inventories table so I think maintaining these in int will be the best option.


Solution

  • I take exception to this:

    This is problematic with the surrogate key because they get added to the database with the latest auto increment and I didn't want their IDs to depend on the order they were inserted.

    ID (correctly: "Id", as it's an abbreviation, not an initialism) is short for "Identity", which has the single property of being unique for each element, that is, it is used to identify each element. You should attach no other connotation to that, so the fact it monotonically increases with insertion order is irrelevant and sorting data by a generated identity column is meaningless (unless it's inside an index used for lookup-by-Id). You should consider Ids as opaque handles in that case.

    Certainly if you use a digest (such as CRC-32) the sort-order of Ids is meaningless, but actually presents less utility than having a monotonically increasing Id.

    You correctly identified the risk of a hash collision, the range space of CRC-32 is only 32 bits, and if you have 25,000 cards then the Birthday Problem tells us the odds of a hash-collision in such a small range space is not insignificant.

    Just use the auto-generated Id value. :)

    Using a computed hash as a identifier / key does have utility - this is how hash-tables work, in that it allows you to quickly find something by value, without actually searching all of the table (e.g. to find the Black Lotus card, take the hash of its properties as you do, then lookup the computed hash in the ID column, without having to do SELECT ... WHERE code = 'A' AND ... AND name = 'black lotus', but it does require you to know every property value first, and if you set up the right table indexes this quickly becomes moot.

    The main problem with using a hash as a primary key is that:

    1. Primary Keys should be immutable ("never-changing")
    2. The key now depends on the data
    3. If the data changes, (e.g. "blcak lotus" becomes "Black Lotus") then the key is invalid and must be recreated, but you can't do that because keys are immutable... rendering the previously-computed Id invalid.

    QED :)