apache-sparkamazon-s3sql-insert

Spark InsertInto Slow Renaming in AWS S3


When using insertInto in AWS s3, the data is initially written to the .spark-staging folder and then moved (fake renaming) to the actual location in batches of 10 files at a time. Moving staging data 10 by 10 is slow.

Is there any way to increase its parallelism?


Solution

  • that is an issue with s3 Committer, check documentation

    https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-s3-optimized-commit-protocol.html

    you should try to avoid using .option("partitionOverwriteMode", "dynamic")

    if your data is e.g. partitioned by date, you should include date in output path

    so instead

    ds.write.option("partitionOverwriteMode", "dynamic").partitionBy(date).parquet("s3://baspath/")
    

    do

    ds.write.parquet(s"s3://baspath/date=$someDate/")