sqlsnowflake-cloud-data-platformsql-merge

ingore duplicate in merge SQL on Snowflake


I have two tables:

src:

ID model accy delivery_date ETA call_off department style duration plant
123abc xxyy MM 2022-12-14T00:00:00.000Z 2022-10-20T00:00:00.000Z 2023-01-17T00:00:00.000Z paint pink 3.3 dd
123abc xxyy MM 2022-12-14T00:00:00.000Z 2022-10-20T00:00:00.000Z 2022-10-20T00:00:00.000Z paint pink 3.3 dd
123abc xxyy MM 2022-12-14T00:00:00.000Z 2022-10-20T00:00:00.000Z 2023-02-06T00:00:00.000Z paint pink 3.3 dd

dest:

ID model accy delivery_date ETA call_off department style duration plant
123abc xxyy MM 2022-12-14T00:00:00.000Z 2022-10-20T00:00:00.000Z 2023-01-17T00:00:00.000Z paint pink 3.3 dd

the dest has this row which is the same as the first row in the src table.

whenever I try to merge I get this error:

100090 (42P18): Duplicate row detected during DML action
Row Values: ["123abc", "xxyy", "MM", "2022-12-14T00:00:00.000Z", "2022-10-20T00:00:00.000Z", "2023-01-17T00:00:00.000Z", "paint", "pink", 3.3, "dd"]

My query:

merge into cleaned_delivery dest
using ( 
SELECT distinct
  RECORD_CONTENT:ID::varchar AS ID,
  RECORD_CONTENT:model::varchar AS model,
  RECORD_CONTENT:accy::varchar AS accy,
  RECORD_CONTENT:delivery_date::varchar AS delivery_date,
  RECORD_CONTENT:ETA::varchar AS ETA,
  RECORD_CONTENT:call_off::varchar AS call_off,
  RECORD_CONTENT:department::varchar AS department,
  RECORD_CONTENT:style::varchar AS style, 
  RECORD_CONTENT:duration::float AS duration,    
  RECORD_CONTENT:plant::varchar AS plant

FROM raw_delivery where RECORD_CONTENT:ID

) as src (ID, model, accy, delivery_date, ETA, call_off, department, style, duration, plant)
on dest.ID = src.ID and dest.model = src.model and dest.accy = src.accy 
when matched
then update set
  dest.delivery_date = src.delivery_date,
  dest.ETA = src.ETA,
  dest.call_off = src.call_off,
  dest.department = src.department,
  dest.style = src.style,
  dest.duration = src.duration
when not matched then insert (ID, model, accy, delivery_date, ETA, call_off, department, style, duration, plant)
  values (ID, model, accy, delivery_date, ETA, call_off, department, style, duration, plant);

I tried adding this line after the when matched:

when matched and (src.delivery_date != dest.delivery_date or src.ETA!= dest.ETA or src.call_off!= dest.call_off)

But still the same error. How can I merge these two tables and ignore rows that already exists (duplicate)


Solution

  • It is possible to allow nondeterministic MERGE by setting ERROR_ON_NONDETERMINISTIC_MERGE

    ALTER SESSION SET ERROR_ON_NONDETERMINISTIC_MERGE = FALSE;
    

    Hiding the fact that source provides multiple matching rows is not recommended course of action. A better way is to explicitly state which row should be taken - unique per columns defined on on dest.ID = src.ID and dest.model = src.model and dest.accy = src.accy :

    merge into cleaned_delivery dest
    using ( 
    SELECT
      RECORD_CONTENT:ID::varchar AS ID,
      RECORD_CONTENT:model::varchar AS model,
      RECORD_CONTENT:accy::varchar AS accy,
      RECORD_CONTENT:delivery_date::varchar AS delivery_date,
      RECORD_CONTENT:ETA::varchar AS ETA,
      RECORD_CONTENT:call_off::varchar AS call_off,
      RECORD_CONTENT:department::varchar AS department,
      RECORD_CONTENT:style::varchar AS style, 
      RECORD_CONTENT:duration::float AS duration,    
      RECORD_CONTENT:plant::varchar AS plant  
    FROM raw_delivery 
    QUALIFY ROW_NUMBER() OVER(PARTITION BY ID, model, accy 
                              ORDER BY <here_goes_columns>) = 1  
    ) AS src
     on dest.ID = src.ID 
    and dest.model = src.model 
    and dest.accy = src.accy