[SOLVED] Deduping SQL Server table

Deduping SQL Server table

I have an issue. I have a table with almost 2 billion rows (yeah I know...) and has a lot of duplicate data in it which I'd like to delete from it. I was wondering how to do that exactly?

The columns are: first, last, dob, address, city, state, zip, telephone and are in a table called PF_main. Each record does have a unique ID thankfully, and its in column called ID.

How can I dedupe this and leave 1 unique entry (row) within the pf_main table for each person??

Thank you all in advance for your responses...

Solution

A 2 billion row table is quite big. Let me assume that first, last, and dob constitutes a "person". My suggestion is to build an index on the "person" and then do the truncate/re-insert approach.

In practice, this looks like:

create index idx_pf_main_first_last_dob on pf_main(first, last, dob);

select m.*
into temp_pf_main
from pf_main m
where not exists (select 1
                  from pf_main m2
                  where m2.first = m.first and m2.last = m.last and m2.dob = m.dob and
                        m2.id < m.id
                 );

truncate table pf_main;

insert into pf_main
    select *
    from temp_pf_main;