sqlitesqlite3-python

How to efficiently remove duplicates from a large table in sqlite?


I have a table consisting of ~200 million rows and 6 columns, for which I only want to keep rows that are unique across all columns.

I am currently doing a CREATE TABLE temp AS SELECT DISTINCT * FROM plates;

This has run for 10 hours and is still running so I must be doing something wrong. I read about indexing and grouping by rowid's, but I do not understand how this would speed up the process since it is still evaluating every row and every column. I am mainly looking for insight here, since I am probably missing something.


Solution

  • You could try:

    create table temp (
        a,b,c,d,e,f,
        primary key (a,b,c,d,e,f) on conflict ignore
    );
    
    insert into temp (a,b,c,d,e,f)
        select a,b,c,d,e,f from plates;
    

    where a,b,c,d,e,f are the relevant columns.