mysqlduplicatescompound-key

Remove rows with duplicate primary compound key


I have a table consisting of the following integer columns:

group_id, person_id, sequence

Ideally, the primary key would be (group_id, person_id), but there are some duplicates I need to remove first. Whenever there are duplicates, I want to keep only the lowest sequence value.

I figured out a query to select some of the errant rows, but it only gets the highest sequence number in the duplicate set.

SELECT COUNT(*) AS num, group_id, person_id, MAX(sequence)
FROM my_table
GROUP BY group_id, person_id
HAVING COUNT(*) > 1;

I'm sure I'm missing something simple. Is there an easy way to remove these duplicates?

Thanks.


Solution

  • Try writing a query that returns the rows you want to delete. Assuming that the combination of (group_id,person_id,sequence) is UNIQUE, and you don't have NULL values...

     SELECT t.* 
       FROM my_table t
       JOIN ( SELECT o.group_id
                   , o.person_id
                   , MAX(o.sequence) AS max_sequence
                FROM my_table o
               GROUP BY o.group_id, o.person_id
              HAVING COUNT(*) > 1
            ) d
        ON d.group_id      = t.group_id
       AND d.person_id     = t.person_id
       AND d.max_sequence  = t.sequence
    

    We can convert that into a DELETE statement by replacing the SELECT keyword with the DELETE keyword.

    Or, when I'm deleting rows with statements similar to this, I will usually create a table as a "backup" of the rows I'm going to delete.

    Just precede the SELECT with CREATE TABLE some_new_table_name AS.

    Then, we can reference the "saved" rows in a DELETE query

    DELETE t.*
      FROM my_table t
      JOIN some_new_table_name d
        ON d.group_id      = t.group_id
       AND d.person_id     = t.person_id
       AND d.max_sequence  = t.sequence
    

    This approach only gets "one" of the duplicates. If the original query had count values larger than 2, then we'd need to repeat this enough times, each time deleting the highest sequence value, repeating that until there are no count values greater than 1.

    If there are a lot of duplicates to remove, we could use a slightly different pattern to get them in one fell swoop.

    Instead of returning MAX(sequence) (the row we want to delete), we can instead return MIN(sequence), the row we want to keep. And we'd change the predicate,

        AND d.max_sequence  = t.sequence
    

    to be

        AND d.min_sequence  <> t.sequence
    

    So that we delete all rows for that group_id, person_id EXCEPT for the one with the minimum value.

    I strongly recommend you write this as a SELECT first, before converting it into a DELETE statement. And I also recommend you have a good backup of the table and/or "save" copies of the rows you are going to delete. Just in case you need to restore some rows.