pythonhadoopmapreducehadoop-streaming

How to distribute Mapreduce task in hadoop streaming


For example I have multiple lines log file I have mapper.py. this script do parse file. In this case I want to do my mapper it independently


Solution

  • Hadoop Streaming is already "distributed", but is isolated to one input and output stream. You would need to write a script to loop over the files and run individual streaming jobs per-file.

    If you want to batch process many files, then you should upload all files to a single HDFS folder, and then you can use mrjob (assuming you actually want MapReduce), or you could switch to pyspark to process them all in parallel, since I see no need to do that sequentially.