I have table with two columns, where there are duplicates:
select record_id,
session_id,
count(record_id) as total_duplicate_records
from `project.dataset.table`
group by 1,2
order by 3 desc
I saw a similar question: BigQuery Standard SQL: Delete Duplicates from Table but an answer based on some extra column to be used to serve as a tie-breaker. In my case there is no such column...
The goal is to run a scheduled query that will delete all duplicate records once a day
You can use SELECT DISTINCT to deduplicate your table.
CREATE OR REPLACE TABLE
`project.dataset.table` -- this will overwrite your table
AS
SELECT DISTINCT
record_id,
session_id
FROM
`project.dataset.table`
Replace the table to be replaced or created with a new name. This will create a new table. Running the query again will recreate/overwrite the new table.
CREATE OR REPLACE TABLE
`project.dataset.table_deduped` -- change to a new table
AS
SELECT DISTINCT
record_id,
session_id
FROM
`project.dataset.table`
Some potential drawbacks:
A view is a virtual table based on a query. The data is not stored in a new table. You query it the same way you query a table. Can be useful if the underlying table or tables gets updated and you always want an up to date 'view' of the data.
CREATE VIEW
`project.dataset.table_deduped` -- change to a new table
AS
SELECT DISTINCT
record_id,
session_id
FROM
`project.dataset.table`
Create a temporary table that you can reference inside the same query.
WITH
deduped_table AS (
SELECT DISTINCT
record_id,
session_id
FROM
`project.dataset.table`
)
SELECT
*
FROM
deduped_table
All the options have their pro's and cons. Depending on what you want to achieve, how your data is structured and how much you need to care about the amount of data your are querying/storing. If you only have the two columns it's probably not that big of a deal but worth noting your options in any case I rate.