I have a performance problem in repartition and partitionBy operation in Spark.
My df
is containing monthly data and i am partitioning data as daily
with dailyDt
column. My code is like below.
This takes 3 minutes to finish, but many small files for each dailyDt partition.
df.repartition(600)
.write
.partitionBy("dailyDt")
.mode(Overwrite)
.parquet("/path..")
This is producing only 1 big file foreach day, so its not the solution.
df.repartition(20, $"dailyDt")
.write
.partitionBy("dailyDt")
.mode(Overwrite)
.parquet("/path..")
Adding salt with 'rand' function, getting 20 files (getting same size for each file) for each day (as expected) but it is taking too much time to run.
import org.apache.spark.sql.functions.rand
df.repartition(20, $"dailyDt", rand)
.write
.partitionBy("dailyDt")
.mode(Overwrite)
.parquet("/path..")
So, i have a solution but its running long. How could i decrease execution time?
fixed it with repartitionByRange.