google-bigquery

How to remove duplicate data from a large data set in the Google Big Query Table?


I have the Google Big Query Project In which I am required to delete duplications from the a column. What is the efficient way to do it ? I am new in Google Big Query ?


Solution

  • One way to solve this problem is to use Windowing funciton. Using that we can identify the duplicates that are occuring. Then based on business logic you can decided which records to keep and which one you will like to drop. I assume here that first occurrence will be retained.

    WITH duplicated_records AS (
        SELECT
            *,
            ROW_NUMBER() OVER (PARTITION BY duplicate_column_name ORDER BY (SELECT NULL)) AS row_number
        FROM
            dataset.table_name
    )
    
    DELETE FROM
        dataset.table_name
    WHERE
        EXISTS (
            SELECT
                1
            FROM
                duplicated_records dr
            WHERE
                dataset.table_name.primary_key = dr.primary_key AND dr.row_number > 1
        );
    

    Try this out and hopefully you will be able to eliminant the duplicate records. If problem persist please feel free to share the error log.