Suppose there is following map reduce job
Mapper:
setup() initializes some state
map() add data to state, no output
cleanup() output state to context
Reducer:
aggregate all states into one output
How such job could be implemented in spark?
Additional question: how such job could be implemented in scalding? I'm looking for example which somehow makes the method overloadings...
Spark map
doesn't provide an equivalent of Hadoop setup
and cleanup
. It assumes that each call is independent and side effect free.
The closest equivalent you can get is to put required logic inside mapPartitions
or mapPartitionsWithIndex
with simplified template:
rdd.mapPartitions { iter => {
... // initalize state
val result = ??? // compute result for iter
... // perform cleanup
... // return results as an Iterator[U]
}}