[SOLVED] AWS Data Pipeline keeps running into FileAlreadyExistsException

AWS Data Pipeline keeps running into FileAlreadyExistsException

I basically followed this tutorial to set up a simple DataPipeline to export my DynamoDB table to S3.

But whenever I tried to run it, it keeps throwing Details :Exception in thread "main" org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory s3://table-ddb-backup/ already exists which doesn't make any sense to me, as I double checked, this bucket doesn't even exist in my AWS account, how come it says already exists?

Also, I've changed the bucket name to a different one, still the same error persists, any pointers please?

Edit: I just learned from AWS docs that each AWS S3 bucket name must be globally unique in each partition, I thought as long as each S3 bucket name is unique within this AWS account is good enough. But this still doesn't explain why this data pipeline jobs keeps failing with this error.

Thanks!

Solution

I figured it out by adding this in part of my CDK code when provisioning my data pipeline:

{
    "key": "preStepCommand",
    "stringValue": "(sudo yum -y update aws-cli) && (aws s3 rm #{output.directoryPath} --recursive)"
},

EMR needs to have an empty directory each time it runs.