We have a Data pipeline that does a nightly copy of our DynamoDB to S3 buckets so we can run reports on the data with Athena. Occasionally the pipeline will fail with a 503 SlowDown error. The retries will usually "succeed" but create tons of duplicate records in S3. The DynamoDB has On-Demand read capacity and the pipeline has 0.5 myDDBReadThroughputRatio. A couple of questions here:
I assume reducing the myDDBReadThroughputRatio would probably lessen the problem, if true does anyone have a good ratio that will still be performant but not cause these errors?
Is there a way to prevent the duplicate records in S3? I can't figure out why these are being generated? (possibly the records from the failed run are not removed?)
Of course any other thoughts/solutions for the problem would be greatly appreciated.
Thanks!
Using AWS Data Pipeline for continuous backups is not recommended.
AWS recently launched a new functionality that allows you to export DynamoDB table data to S3 and can be further analysed by Athena. Check it out here
You can also use Amazon glue to do the same (link).
If you still want to continue to use data pipeline, then the issue seems to be happening due to S3 limits being reached. You might need to see if there are other requests also writing to S3 at same time OR if you can limit the request rate from pipeline using some configuration.