performancescalaapache-sparkrdd

Apache Spark: map vs mapPartitions?


What's the difference between an RDD's map and mapPartitions method? And does flatMap behave like map or like mapPartitions? Thanks.

(edit) i.e. what is the difference (either semantically or in terms of execution) between

  def map[A, B](rdd: RDD[A], fn: (A => B))
               (implicit a: Manifest[A], b: Manifest[B]): RDD[B] = {
    rdd.mapPartitions({ iter: Iterator[A] => for (i <- iter) yield fn(i) },
      preservesPartitioning = true)
  }

And:

  def map[A, B](rdd: RDD[A], fn: (A => B))
               (implicit a: Manifest[A], b: Manifest[B]): RDD[B] = {
    rdd.map(fn)
  }

Solution

  • What's the difference between an RDD's map and mapPartitions method?

    The method map converts each element of the source RDD into a single element of the result RDD by applying a function. mapPartitions converts each partition of the source RDD into multiple elements of the result (possibly none).

    And does flatMap behave like map or like mapPartitions?

    Neither, flatMap works on a single element (as map) and produces multiple elements of the result (as mapPartitions).