I have an issue. I have a table with almost 2 billion rows (yeah I know...) and has a lot of duplicate data in it which I'd like to delete from it. I was wondering how to do that exactly?
The columns are: first, last, dob, address, city, state, zip, telephone and are in a table called PF_main
. Each record does have a unique ID thankfully, and its in column called ID
.
How can I dedupe this and leave 1 unique entry (row) within the pf_main
table for each person??
Thank you all in advance for your responses...
A 2 billion row table is quite big. Let me assume that first
, last
, and dob
constitutes a "person". My suggestion is to build an index on the "person" and then do the truncate
/re-insert approach.
In practice, this looks like:
create index idx_pf_main_first_last_dob on pf_main(first, last, dob);
select m.*
into temp_pf_main
from pf_main m
where not exists (select 1
from pf_main m2
where m2.first = m.first and m2.last = m.last and m2.dob = m.dob and
m2.id < m.id
);
truncate table pf_main;
insert into pf_main
select *
from temp_pf_main;