sqlpostgresqlduplicates

Delete duplicate rows from table with no unique key


How do I delete duplicates rows in Postgres 9 table, the rows are completely duplicates on every field AND there is no individual field that could be used as a unique key so I cant just GROUP BY columns and use a NOT IN statement.

I'm looking for a single SQL statement, not a solution that requires me to create temporary table and insert records into that. I know how to do that but requires more work to fit into my automated process.

Table definition:

jthinksearch=> \d releases_labels;
Unlogged table "discogs.releases_labels"
   Column   |  Type   | Modifiers
------------+---------+-----------
 label      | text    |
 release_id | integer |
 catno      | text    |
Indexes:
    "releases_labels_catno_idx" btree (catno)
    "releases_labels_name_idx" btree (label)
Foreign-key constraints:
    "foreign_did" FOREIGN KEY (release_id) REFERENCES release(id)

Sample data:

jthinksearch=> select * from releases_labels  where release_id=6155;
    label     | release_id |   catno
--------------+------------+------------
 Warp Records |       6155 | WAP 39 CDR
 Warp Records |       6155 | WAP 39 CDR

Solution

  • If you can afford to rewrite the whole table, this is probably the simplest approach:

    WITH Deleted AS (
      DELETE FROM discogs.releases_labels
      RETURNING *
    )
    INSERT INTO discogs.releases_labels
    SELECT DISTINCT * FROM Deleted
    

    If you need to specifically target the duplicated records, you can make use of the internal ctid field, which uniquely identifies a row:

    DELETE FROM discogs.releases_labels
    WHERE ctid NOT IN (
      SELECT MIN(ctid)
      FROM discogs.releases_labels
      GROUP BY label, release_id, catno
    )
    

    Be very careful with ctid; it changes over time. But you can rely on it staying the same within the scope of a single statement.