sqlsql-serverdeduplication

Deduping SQL Server table


I have an issue. I have a table with almost 2 billion rows (yeah I know...) and has a lot of duplicate data in it which I'd like to delete from it. I was wondering how to do that exactly?

The columns are: first, last, dob, address, city, state, zip, telephone and are in a table called PF_main. Each record does have a unique ID thankfully, and its in column called ID.

How can I dedupe this and leave 1 unique entry (row) within the pf_main table for each person??

Thank you all in advance for your responses...


Solution

  • A 2 billion row table is quite big. Let me assume that first, last, and dob constitutes a "person". My suggestion is to build an index on the "person" and then do the truncate/re-insert approach.

    In practice, this looks like:

    create index idx_pf_main_first_last_dob on pf_main(first, last, dob);
    
    select m.*
    into temp_pf_main
    from pf_main m
    where not exists (select 1
                      from pf_main m2
                      where m2.first = m.first and m2.last = m.last and m2.dob = m.dob and
                            m2.id < m.id
                     );
    
    truncate table pf_main;
    
    insert into pf_main
        select *
        from temp_pf_main;