When using insertInto
in AWS s3, the data is initially written to the .spark-staging
folder and then moved (fake renaming) to the actual location in batches of 10 files at a time. Moving staging data 10 by 10 is slow.
Is there any way to increase its parallelism?
that is an issue with s3 Committer, check documentation
https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-s3-optimized-commit-protocol.html
you should try to avoid using
.option("partitionOverwriteMode", "dynamic")
if your data is e.g. partitioned by date, you should include date in output path
so instead
ds.write.option("partitionOverwriteMode", "dynamic").partitionBy(date).parquet("s3://baspath/")
do
ds.write.parquet(s"s3://baspath/date=$someDate/")