amazon-web-servicesamazon-redshiftamazon-data-pipeline

AWS DataPipeline: RedshiftCopyActivity OVERWRITE_EXISTING not enforcing primary key


I have a DataPipeline that exports data from a local DB to Redshift via S3 (very similar to Incremental copy of RDS MySQL table to Redshift template). I have defined primary key and set insertMode to "OVERWRITE_EXISTING" in pipeline definition, however, I noticed that some rows eventually were duplicated. In what cases does it happen and how do I prevent it?


Solution

  • Just found this post after several years, adding an answer in case it helps someone else:

    In addition to primary keys Redshift also uses distkeys to determine which lines to overwrite. So in my case an updated value in distkey column forced Redshift to create a duplicate row, although the primary key remained unchanged.