hadoopelastic-map-reducemrjob

Write some data (lines) from my mappers to separate directories depending on some logic in my mapper code


I am using mrjob for my EMR needs.

How do I write some data (lines) from my mappers to "separate directories" depending on some logic in my mapper code that I can:

  1. tar gzip and

  2. upload to separate S3 buckets (depending on the directory name) after the job finishes/terminates abruptly?

I guess the '--output-dir' options only allows you to upload the final job output to that directory, but I would like to write to other directories as well from time to time from my mappers.


Solution

  • No you can't in the traditional sense.

    Reason: MrJob internally uses Hadoop streaming to run map/reduce jobs when running with Hadoop cluster I am assuming that this is same for Amazon Elastic M/R as it is for Hadoop cluster.

    The --output-dir is actually an input to Hadoop streaming which specifies where the the output of reducers will be collected. You can not use this mechanism for segregating data into different folders.

    [Edit: In response to comment]

    My understanding is that boto is only a library to connect to Amazon services and access ec2 and s3 etc.

    In a non-traditional sense you can still write to different directories, I guess.

    I have not tested this idea and don't recommend this approach. This would be like opening a file and writing to it directly within the reducers. Theoretically you could do that. Instead of just writing the reducer output to std.out. You could possibly open and write to S3 objects directly. You have to ensure that you will open different files as it spawns multiple reducers.

    This is what I learned while using MrJob with Hadoop cluster: http://pyfunc.blogspot.com/2012/05/hadoop-map-reduce-with-mrjob.html