I have a DataPipeline that exports data from a local DB to Redshift via S3 (very similar to Incremental copy of RDS MySQL table to Redshift template). I have defined primary key and set insertMode to "OVERWRITE_EXISTING" in pipeline definition, however, I noticed that some rows eventually were duplicated. In what cases does it happen and how do I prevent it?
Just found this post after several years, adding an answer in case it helps someone else:
In addition to primary keys Redshift also uses distkeys to determine which lines to overwrite. So in my case an updated value in distkey column forced Redshift to create a duplicate row, although the primary key remained unchanged.