df_CorpBond= spark.read.format("parquet").option("header", "true").load(f"/mnt/{container_name}/raw_data/dfl.corporate.parquet")
df_CorpBond.repartition(100).write\
.format("jdbc")\
.option("url", url_connector)\
.option("dbtable", "MarkIt_CorpBonds")\
.option("user", user)\
.option("password", pwd)\
.option("driver", "com.microsoft.sqlserver.jdbc.SQLServerDriver")\
.option("numPartitions", 100)\
.option("batchsize", 100000)\
.mode("overwrite")\
.save()
this is my code reading blob files and loadded into ssms jobs was taking alomosr 2 1/2 hours data size is 2.3 gb
even i tried with repartition values as 20 still jobs taking longer time some times first job completed with in 2 min next immedently new job was started with two stages first satge contian 23 tasks next stage contains 20(based on partiton number) this partion stage take longer time to complete the task
Repartitioning high number of partitions will cause Issues related to performance. And if the target database cannot handle parallel inserts effectively it will lead to performance issues.
Consider converting your Parquet data to Delta Lake format.
Delta Lake provides optimizations and ACID
transaction:
1st Approach:
Read the Parquet file
df_CorpBond = spark.read.format("parquet").option("header", "true").load(f"/mnt/{container_name}/raw_data/dfl.corporate.parquet")
Write the DataFrame to a Delta table
df_CorpBond.write.format("delta").mode("overwrite").save(f"/mnt/{container_name}/delta/MarkIt_CorpBonds")
Know more about the Delta vs. Parquet: A Deep Dive into Big Data Storage Solutions
2nd Aprroach:
df_CorpBond.coalesce(10).write \
.format("jdbc") \
.option("url", jdbcUrl) \
.option("dbtable", table_name) \
.option("user", connectionProperties["user"]) \
.option("password", connectionProperties["password"]) \
.option("driver", connectionProperties["driver"]) \
.option("batchsize", 10000) \
.mode("overwrite") \
.save()
In the above code I have used coalesce instead of repartition to reduce the number of partitions, as coalesce avoids a full shuffle.
Know more about PySpark Repartition() vs Coalesce()