sqlsql-serverdeduplication

De-duplicating similar but not identical URLs with a SQL query


I have a dataset with thousands of URLs stored in a column called Website (type VARCHAR) in a table called WebsiteData. There are many pairs of URLs (stored in separate rows) that are identical except that one begins with www, e.g. www.google.com and the other does not, e.g., google.com. How would I design a SQL query that identifies these pseudo-duplicates and deletes the version that does not start with www?


Solution

  • I derived 2 tables one with urls that have www. and one without. Join them together by adding www. to the urls without.

    -- SELECT first to review the records.
    select *
    from
    (select * from website where url not like 'www.%') wA
    join 
    (select * from website where url like 'www.%') wB
        on 'www.' + wa.url = wb.url
    
    
    delete wA
    from
    (select * from website where url not like 'www.%') wA
    join 
    (select * from website where url like 'www.%') wB
        on 'www.' + wa.url = wb.url