I went through documentation and it says it is meant for aws, gcp. But they are also using it internally somehow right. So, there should be a way to make it run in our own locally created hadoop cluster in our own virtual box
some code for understanding how mrjob is used in code :-
class MovieSimilar(MRJob):
def mapper_parse_input(self, key, line):
(userID, movieID, rating, timestamp) = line.split('\t')
yield userID, (movieID, float(rating))
..........
..........
if __name__ == '__main__':
MovieSimilar.run()
With hadoop streaming jar and normal python codes I am able to run python codes.But mrjob isn't accepting data-set location from command line and giving more than 2 values required to unpack. And that error is because it is unable to take date set given -input flag
The shell command I am using :-
bin/hadoop jar /usr/local/Cellar/hadoop/3.1.0/libexec/share/hadoop/tools/lib/hadoop-
streaming.jar \
-file /<path_to_mapper>/MovieSimilar.py \
-mapper /<path_to_mapper>/MovieSimilar.py \
-reducer /<path_to_reducer>/MovieSimilar.py \
-input daily/<dataset-file>.csv \
-output daily/output
Note:- daily is my hdfs directory where datasets and result of programs get stored
Error message I am receiving :- more than 2 values required to unpack
says it is meant for aws, gcp
Those are examples. It is not meant for those. Notice the -r local
and -r hadoop
flags for running a job
https://mrjob.readthedocs.io/en/latest/guides/runners.html#running-on-your-own-hadoop-cluster
there should be a way to make it run in our own locally created hadoop cluster in our own virtual box
Setup your HADOOP_HOME
, and HADOOP_CONF_DIR
xml files to point at the cluster you want to run the code against, then using the -r hadoop
runner flag, it'll find and run your code using the hadoop binary and hadoop-streaming jar file
more than 2 values required to unpack
. And that error is because it is unable to take date set given -input flag
Can't see your input, but this line would cause that error if there were less than three tabs on any line (and you don't need parentheses left of the equals)
(userID, movieID, rating, timestamp) = line.split('\t')
I suggest testing your code using the local/inline runner first
The shell command I am using :-
bin/hadoop jar /usr/local/Cellar/hadoop/3.1.0/libexec/share/hadoop/tools/lib/hadoop- streaming.jar
Mrjob will build and submit that for you.
You only need to run python MovieSimilar.py
with your input files