scalaapache-sparkdistributed-computingpartitioningexecutors

Spark application uses only 1 executor


I am running an application with the following code. I don't understand why only 1 executor is in use even though I have 3. When I try to increase the range, my job fails cause the task manager loses executor. In the summary, I see a value for shuffle writes but shuffle reads are 0 (maybe cause all the data is on one node and no shuffle read needs to happen to complete the job).

val rdd: RDD[(Int, Int)] = sc.parallelize((1 to 10000000).map(k => (k -> 1)).toSeq)
val rdd2= rdd.sortByKeyWithPartition(partitioner = partitioner)
val sorted = rdd2.map((_._1))
val count_sorted = sorted.collect()

Edit: I increased the executor and driver memory and cores. I also changed the number of executors to 1 from 4. That seems to have helped. I now see shuffle read/writes on each node.


Solution

  • ..maybe cause all the data is on one node

    That should make you think that your RDD has only one partition, instead of 3, or more, that would eventually utilize all the executors.

    So, extending on Hokam's answer, here's what I would do:

    rdd.getNumPartitions
    

    Now if that is 1, then repartition your RDD, like this:

    rdd = rdd.repartition(3) 
    

    which will partition your RDD into 3 partitions.

    Try executing your code again now.