I need to assign a unique integer id to each spark executor in a spark application. I need to retrieve the executor id from within a task running on an executor. The executor id will be used, along with other data elements (timestamp, mac address, etc), to generate unique 64 bit keys. How can I assign a unique integer key to every Apache Spark Executor within an Apache Spark Java Application?
The id of the partition might be useful, as all elements of a single partition will always be on one executor.
mapPartitionsWithIndex can help:
val spark = SparkSession.builder.master("local[*]").appName("partitionIndex").getOrCreate()
import spark.implicits._
val ds = spark.createDataset(Seq.range(1, 21)).repartition(4)
ds.rdd
.mapPartitionsWithIndex((partitionIndex, it) => {
println("processing partition " + partitionIndex)
it.toList.map(i => new String("partition " + partitionIndex + " contains number " + i)).iterator
})
.foreach(println)
prints:
processing partition 1
processing partition 0
processing partition 2
processing partition 3
partition 1 contains number 3
partition 2 contains number 4
partition 2 contains number 9
partition 2 contains number 14
partition 2 contains number 19
partition 0 contains number 2
...
partition 3 contains number 1
partition 3 contains number 5
...
If you are able to assign all rows within one partition a unique id, then the combination of this unique id and the partition index will be unique in the whole system.