postgresqlprimary-keycomposite-primary-key

What is the typical method for preventing duplicates in a PostgreSQL table?


Consider the following table, recipe, in a PostgreSQL database.

Column Type Collation Nullable Default
name text not null
date date not null
ingredients text[] not null
calories integer not null

Assume a user wanted to add a new recipe to the table. The user does not want duplicates in the table.

What is the best practice for inserting a new row into this table without duplicates?

Solution 1

Introduce an identification scheme before adding any recipes. We can say any two recipes with the same name and ingredients are duplicates. The goal is to prevent duplicates from arising so a unique ID based on the recipe name and ingredients is generated (somehow), and the column is added to the database.

Column Type Collation Nullable Default
id text not null
name text not null
date date not null
ingredients text[] not null
calories integer not null

We can then use this ID has a unique primary key in the table.

One (bad) possible way of generating this unique ID is to concatenate the name and ingredients of any new recipe, hash that, and get a unique 256-bit string out of it.

If I find a recipe online and insert the "new" recipe into my table and find that the ID already exists, I know this is not actually a new recipe and I can safely abort the insertion.

Solution 2

Create a composite primary key comprised of the name and ingredients columns.

The following example is pretty bad but bear with me.

If I run some search online for recipes and add recipes to my table, old recipes that have been updated to have more accurate calories will be updated.

Solution 3

Upon every insert to the table, check the new recipe against the name and ingredients of all existing recipes. If there's a collision, update the calories column of the recipe in the table.

If my recipe example is upscaled to thousands or millions of rows, solutions 1 and 3 break.

Risks and Constraints

Solution 1 breaks because hashing collisions become a real possibility at large scale.

Solution 3 "breaks" comparing every single new entry against every single row in a very large table would be horribly inefficient.


Solution

  • Which fields make up a "unique identity" is a business requirement, not a technical one. For example, reasonable and customary would be to impose that the Name of a Recipe must be unique and that's it, no need for more complication, and no sense in saying two recipes are distinct just because I wrote 815 calories and you wrote 810, the rest being the same. Then to create a variant/slightly different version of a recipe that is already in, just give it a new, unique, name that is enough to distinguish what is what when a user is browsing the list of Recipe names in the catalogue... Or, e.g. for a Customers entity in an accounting system, one would maybe want a VAT if those are nationally unique (they aren't in all countries and some added element is needed); while, for an ecommerce site, an email is usually enough since that's enough for individual legal responsibility should the case arise, while a person with multiple accounts (under different emails) is not per se a problem. And so on, those are just examples of the line of reasoning.

    OTOH, the technical problem, as to how to implement any such uniqueness constraints, is in fact just that: either make a primary key out of those fields, or (more common) put an ID or a GUID in the table for a PK, but also put a uniqueness constraint on the field or group of fields that must be unique.

    Finally, on the performance concerns: if you use the DBMS appropriately, there are no such problems, in fact the DBMS supposedly implements most optimized functionality in that sense. In particular, a uniqueness constraint is based on an underlying unique index (similarly for a PK), and no full scan of the rows of a table is involved in the enforcing of these constraints (with possibly some caveats that depend on the specific DBMS).