[SOLVED] Spark mapPartitionsWithIndex : Identify a partition

Spark mapPartitionsWithIndex : Identify a partition

Identify a partition :

mapPartitionsWithIndex(index, iter)

The method results into driving a function onto each partition. I understand that we can track the partition using "index" parameter.

Numerous examples have used this method to remove the header in a data set using "index = 0" condition. But how do we make sure that the first partition which is read (translating, "index" parameter to be equal to 0) is indeed the header. Isint it random or based upon the partitioner, if used.

Solution

Isn't it random or based upon the partitioner, if used?

It is not random but partitioner number. You can understand it with below mentioned simple example

val base = sc.parallelize(1 to 100, 4)    
base.mapPartitionsWithIndex((index, iterator) => {

  iterator.map { x => (index, x) }

}).foreach { x => println(x) }

Result : (0,1) (1,26) (2,51) (1,27) (0,2) (0,3) (0,4) (1,28) (2,52) (1,29) (0,5) (1,30) (1,31) (2,53) (1,32) (0,6) ... ...