javaapache-sparkkey-generator

How can I assign a unique integer key to every Apache Spark Executor within an Apache Spark Java Application?


I need to assign a unique integer id to each spark executor in a spark application. I need to retrieve the executor id from within a task running on an executor. The executor id will be used, along with other data elements (timestamp, mac address, etc), to generate unique 64 bit keys. How can I assign a unique integer key to every Apache Spark Executor within an Apache Spark Java Application?


Solution

  • The id of the partition might be useful, as all elements of a single partition will always be on one executor.

    mapPartitionsWithIndex can help:

    val spark = SparkSession.builder.master("local[*]").appName("partitionIndex").getOrCreate()
    import spark.implicits._
    
    val ds = spark.createDataset(Seq.range(1, 21)).repartition(4)
    ds.rdd
      .mapPartitionsWithIndex((partitionIndex, it) => {
        println("processing partition " + partitionIndex)
        it.toList.map(i => new String("partition " + partitionIndex + " contains number " + i)).iterator
      })
      .foreach(println)
    

    prints:

    processing partition 1
    processing partition 0
    processing partition 2
    processing partition 3
    partition 1 contains number 3
    partition 2 contains number 4
    partition 2 contains number 9
    partition 2 contains number 14
    partition 2 contains number 19
    partition 0 contains number 2
    ...
    partition 3 contains number 1
    partition 3 contains number 5
    ...
    

    If you are able to assign all rows within one partition a unique id, then the combination of this unique id and the partition index will be unique in the whole system.