I have a dataset with thousands of URLs stored in a column called Website
(type VARCHAR
) in a table called WebsiteData
. There are many pairs of URLs (stored in separate rows) that are identical except that one begins with www
, e.g. www.google.com
and the other does not, e.g., google.com
. How would I design a SQL query that identifies these pseudo-duplicates and deletes the version that does not start with www
?
I derived 2 tables one with urls that have www.
and one without. Join them together by adding www. to the urls without.
-- SELECT first to review the records.
select *
from
(select * from website where url not like 'www.%') wA
join
(select * from website where url like 'www.%') wB
on 'www.' + wa.url = wb.url
delete wA
from
(select * from website where url not like 'www.%') wA
join
(select * from website where url like 'www.%') wB
on 'www.' + wa.url = wb.url