I have the Google Big Query Project In which I am required to delete duplications from the a column. What is the efficient way to do it ? I am new in Google Big Query ?
One way to solve this problem is to use Windowing funciton. Using that we can identify the duplicates that are occuring. Then based on business logic you can decided which records to keep and which one you will like to drop. I assume here that first occurrence will be retained.
WITH duplicated_records AS (
SELECT
*,
ROW_NUMBER() OVER (PARTITION BY duplicate_column_name ORDER BY (SELECT NULL)) AS row_number
FROM
dataset.table_name
)
DELETE FROM
dataset.table_name
WHERE
EXISTS (
SELECT
1
FROM
duplicated_records dr
WHERE
dataset.table_name.primary_key = dr.primary_key AND dr.row_number > 1
);
Try this out and hopefully you will be able to eliminant the duplicate records. If problem persist please feel free to share the error log.