shufflemkmaprect

Is sort part of shuffle in mapreduce


the process by which the system sort the map output on map side is known as the sort. is this part of shuffle? In other words, when does shuffle start? After the map output has been wrote to disk, or after the map output has been wrote to the buffer in memory


Solution

  • The whole Map-reduce processed is explained at detailed level here: http://ercoppa.github.io/HadoopInternals/AnatomyMapReduceJob.html

    To answer your question, the steps in single map task comprises of:

    Single MapTask lifecycle

    The Execution and Spilling phase occurs in-parallel. So, data is written in a circular buffer memory -> Sorted in memory -> When buffer is 80% full -> Written to local disk.

    enter image description here

    At the end of the EXECUTION phase, the SPILLING thread is triggered for the last time. In more detail, we:

    Notice that for each time the buffer was almost full, we get one spill file (SpillReciord + output file). Each Spill file contains several partitions (segments).