apache-sparkspark-shell

Does the Spark Shell JDBC read numPartitions value depend on the number of executors?


I have Spark set up in standalone mode on a single node with 2 cores and 16GB of RAM to make some rough POCs.
I want to load data from a SQL source using val df = spark.read.format('jdbc')...option('numPartitions',n).load(). When I tried to measure the time taken to read a table for different numPartitions values by calling a df.rdd.count, I saw the the time was the same regardless of the value I gave. I also noticed one the context web UI that the number of Active executors was 1, even though I gave SPARK_WORKER_INSTANCES=2 and SPARK_WORKER_CORES=1in my spark_env.sh file.

I have 2 questions:
Do the numPartitions actually created depend on the number of executors?
How do I start spark-shell with multiple executors in my current setup?

Thanks!


Solution

  • Number of partitions doesn't depend on your number of executors - althaugh there is best practice (partitions per cores), but it doesn't determined by the executors instances.

    In case of reading from JDBC, to make it parallelize reading you need a partition column, e.g:

    spark.read("jdbc")
      .option("url", url)
      .option("dbtable", "table")
      .option("user", user)
      .option("password", password)
      .option("numPartitions", numPartitions)
      .option("partitionColumn", "<partition_column>")
      .option("lowerBound", 1)
      .option("upperBound", 10000)
      .load()
    

    That will parallel the queries from the databases to 10,000/numPartitions results of each query.

    About your second question, you can find all over spark configuration over here: https://spark.apache.org/docs/latest/configuration.html , (spark2-shell --num-executors, or the configuration --conf spark.executor.instances).

    Specifing the number of the executors meaning dynamic allocation will be off so be aware of that.