python-3.xhadoop-streaming

Does python multiprocessing work with Hadoop streaming?


In Hadoop streaming - where the Mapper and Reducer are written in python - Does it help to make the Mapper process use the multiprocessing module? Or does the scheduler prevent the Mapper scripts from running on multiple threads on the compute nodes?


Solution

  • In classic MapReduce there is nothing that stops you from having multiple threads in a mapper or a reducer. The same is true for Hadoop Streaming, you can very well have multiple threads per mapper or reducer. This situation can happen if you have a CPU heavy job and want to speed it up.

    If you're doing Hadoop Streaming with Python, you can use the multiprocessing module to speed up your mapper phase.

    Note that depending on the way your Hadoop cluster is configured (how many JVM mapper/reducer per nodes) you may have to adjust the maximum number of processes you can use.