javascalamapreducecascadingscalding

Scalding 'multiple map()' optimization


Are the following two code blocks equivalent in terms of performances?

val input: TypedPipe[Person] = ....
input
  .map(_.getName)
  .map(_.split(" "))

and...

val input: TypedPipe[Person] = ....
input
  .map(_.getName.split(" "))

Specifically, is Scalding going to optimize the code and execute a single map only job for both the snippets above at all times? What if the map functions are way more complex than getName/split?

IMO (and for far more complex map functions) the first example is more readable. However, I'm concerned that it might result in a less efficient runtime execution.


Solution

  • The two functions won't be collapsed at the bytecode / scalac layer, but more importantly scalding will always collapse them into a single map task in hadoop. In fact, all your map-like operators (map, flatMap, filter, etc) will be collapsed into 1 map task, or even into the end of a reduce task.

    So your two examples will have the same DAG in hadoop, the only difference being some extra function call overhead.

    It is very unlikely that the overhead of calling these functions separately is a performance bottleneck compared to the serialization / deserialization and IO going on in your scalding job. And it's also possible that the hotspot vm will JIT some of this into native instructions as well.

    I'd definitely recommend going for readability, unless you've done extensive profiling and found this to be a bottleneck (I'd be very surprised).