I have a table consisting of the following integer columns:
group_id, person_id, sequence
Ideally, the primary key would be (group_id, person_id)
, but there are some duplicates I need to remove first. Whenever there are duplicates, I want to keep only the lowest sequence value.
I figured out a query to select some of the errant rows, but it only gets the highest sequence number in the duplicate set.
SELECT COUNT(*) AS num, group_id, person_id, MAX(sequence)
FROM my_table
GROUP BY group_id, person_id
HAVING COUNT(*) > 1;
I'm sure I'm missing something simple. Is there an easy way to remove these duplicates?
Thanks.
Try writing a query that returns the rows you want to delete. Assuming that the combination of (group_id,person_id,sequence)
is UNIQUE, and you don't have NULL values...
SELECT t.*
FROM my_table t
JOIN ( SELECT o.group_id
, o.person_id
, MAX(o.sequence) AS max_sequence
FROM my_table o
GROUP BY o.group_id, o.person_id
HAVING COUNT(*) > 1
) d
ON d.group_id = t.group_id
AND d.person_id = t.person_id
AND d.max_sequence = t.sequence
We can convert that into a DELETE
statement by replacing the SELECT
keyword with the DELETE
keyword.
Or, when I'm deleting rows with statements similar to this, I will usually create a table as a "backup" of the rows I'm going to delete.
Just precede the SELECT with CREATE TABLE some_new_table_name AS
.
Then, we can reference the "saved" rows in a DELETE query
DELETE t.*
FROM my_table t
JOIN some_new_table_name d
ON d.group_id = t.group_id
AND d.person_id = t.person_id
AND d.max_sequence = t.sequence
This approach only gets "one" of the duplicates. If the original query had count values larger than 2, then we'd need to repeat this enough times, each time deleting the highest sequence value, repeating that until there are no count values greater than 1.
If there are a lot of duplicates to remove, we could use a slightly different pattern to get them in one fell swoop.
Instead of returning MAX(sequence)
(the row we want to delete), we can instead return MIN(sequence)
, the row we want to keep. And we'd change the predicate,
AND d.max_sequence = t.sequence
to be
AND d.min_sequence <> t.sequence
So that we delete all rows for that group_id, person_id
EXCEPT for the one with the minimum value.
I strongly recommend you write this as a SELECT
first, before converting it into a DELETE
statement. And I also recommend you have a good backup of the table and/or "save" copies of the rows you are going to delete. Just in case you need to restore some rows.