apache-sparkapache-spark-sqlsparkcore

How to configure the number of partition not exceeds available cores?


I am looking for a way to partition all my dataframe in my application as per the size of available core.. If my available cores(number of executor * number of cores per executor) is 20, then I want to repartition all my dataframe to 20..

The only way I can see to repartition my dataframe is df.repartition(20) but I am looking to apply this for all dataframe that exist in my application without having to write df.repartition(20) for every dataframe.

Changing spark.default.parallelism conf is not working since it gets applied only when you work at RDD(lower level api) and not at dataframe..

Any suggestion around this?


Solution

  • If you are using the DataFrame/Dataset API, then you can set the number of the default shuffle partitions using this configuration directive:

    spark.sql.shuffle.partitions
    

    You can read more about this configuration options on the Performance Tuning page.

    With this configuration option, any transformation that triggers a shuffle of the data will automatically re-partition the data to this number of partitions.