We have a table which has one billion three hundred and fifty-five million rows. The table has 20 columns.
We want to join this table with another table which has more of less same number of rows.
How to decide number of spark.conf.set("spark.sql.shuffle.partitions",?)
How to decide number of executors and its resource allocation details?
How to find the amount of storage those one billion three hundred and fifty-five million rows will take in memory?
Like @samkart says, you have to experiment to figure out the best parameters since it depends on the size and nature of your data. The spark tuning guide would be helpful.
Here are some things that you may want to tweak:
spark.executor.cores
is 1 by default but you should look to increase this to improve parallelism. A rule of thumb is to set this to 5. spark.files.maxPartitionBytes
determines the amount of data per partition while reading, and hence determines the initial number of partitions. You could tweak this depending on the data size. Default is 128 MB blocks in HDFS.spark.sql.shuffle.partitions
is 200 by default but tweak it depending on the data size and number of cores. This blog would be helpful.