scalaapache-sparkscalding

How to override setup and cleanup methods in spark map function


Suppose there is following map reduce job

Mapper:

setup() initializes some state

map() add data to state, no output

cleanup() output state to context

Reducer:

aggregate all states into one output

How such job could be implemented in spark?

Additional question: how such job could be implemented in scalding? I'm looking for example which somehow makes the method overloadings...


Solution

  • Spark map doesn't provide an equivalent of Hadoop setup and cleanup. It assumes that each call is independent and side effect free.

    The closest equivalent you can get is to put required logic inside mapPartitions or mapPartitionsWithIndex with simplified template:

    rdd.mapPartitions { iter => {
       ... // initalize state
       val result = ??? // compute result for iter
       ... // perform cleanup
       ... // return results as an Iterator[U]
    }}