sqlduplicatesclickhouse

How can I find duplicate records in a clickhouse db table


How can I find duplicate data entries in one specific table in clickhouse DB?

I am actually investigating on a merge tree table and actually threw optimize statements at my table but that didn't do the trick. The duplicate entries still persist.

Preferred would be to have a universal strategy without referencing individual column names.

I only want to see the duplicate entries, since I am working on very large tables.


Solution

  • The straight forward way would be to run this query.

    SELECT 
        *, 
        count() AS cnt
    FROM myDB.myTable
    GROUP BY *
    HAVING cnt > 1
    ORDER BY date ASC
    

    If that query gets to big you can run it in pieces.

    SELECT 
        *,
        count() AS cnt
    FROM myDB.myTable
    WHERE (date >= '2020-08-01') AND (date < '2020-09-01')
    GROUP BY *
    HAVING cnt > 1
    ORDER BY date ASC