scalaapache-sparkrddhadoop-partitioning

Spark mapPartitionsWithIndex : Identify a partition


Identify a partition :

mapPartitionsWithIndex(index, iter)

The method results into driving a function onto each partition. I understand that we can track the partition using "index" parameter.

Numerous examples have used this method to remove the header in a data set using "index = 0" condition. But how do we make sure that the first partition which is read (translating, "index" parameter to be equal to 0) is indeed the header. Isint it random or based upon the partitioner, if used.


Solution

  • Isn't it random or based upon the partitioner, if used?

    It is not random but partitioner number. You can understand it with below mentioned simple example

    val base = sc.parallelize(1 to 100, 4)    
    base.mapPartitionsWithIndex((index, iterator) => {
    
      iterator.map { x => (index, x) }
    
    }).foreach { x => println(x) }
    

    Result : (0,1) (1,26) (2,51) (1,27) (0,2) (0,3) (0,4) (1,28) (2,52) (1,29) (0,5) (1,30) (1,31) (2,53) (1,32) (0,6) ... ...