google-bigquery

Can I prevent duplicate data in bigquery?


I'm playing with BQ and I create a table and inserted some data. I reinserted it and it created duplicates. I'm sure I'm missing something, but is there something I can do to ignore it if the data exists in the table?

My use case is I get a stream of data from various clients and sometimes their data will include some data they previously already sent(I have no control on them submitting).

Is there a way to prevent duplicates when certain conditions are met? The easy one is if the entire data is the same but also if certain columns are present?


Solution

  • It's difficult to answer your question without a clear idea of the table structure, but it feels like you could be interested in the MERGE statement: ref here.

    With this DML statement you can perform a mix of INSERT, UPDATE, and DELETE statements, hence do exactly what you are describing.