i am trying to execute hadoop streaming with mapreduce using python code however, it's always giving the same error result,
File: file:/C:/py-hadoop/map.py is not readable
or
Error: java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 1
im using hadoop 3.1.1 and python 3.8, with Windows 10 os
here's my map reduce command line
hadoop jar C:/hadoop/share/hadoop/tools/lib/hadoop-streaming-3.1.1.jar -file C:/py-hadoop/map.py,C:/py-hadoop/reduce.py -mapper "python map.py" -reducer "python reduce.py" -input /user/input -output /user/python-output
map.py
import sys
for line in sys.stdin:
line = line.strip()
words = line.split()
for word in words:
print ("%s\t%s" % (word, 1))
reduce.py
from operator import itemgetter
import sys
current_word = None
current_count = 0
word = None
clean = '0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ '
for line in sys.stdin:
line = line.strip()
word, count = line.split('\t', 1)
try:
count = int(count)
except ValueError:
continue
word = filter(lambda x: x in clean, word).lower()
if current_word == word:
current_count += count
else:
if current_word:
print ("%s\t%s" % (current_word, current_count))
current_count = count
current_word = word
if current_word == word:
print ("%s\t%s" % (current_word, current_count))
also already tried with different command line, like
hadoop jar C:/hadoop/share/hadoop/tools/lib/hadoop-streaming-3.1.1.jar -file C:/py-hadoop/map.py -mapper "python map.py" -file C:/py-hadoop/reduce.py -reducer "python reduce.py" -input /user/input -output /user/python-output
and
hadoop jar C:/hadoop/share/hadoop/tools/lib/hadoop-streaming-3.1.1.jar -file py-hadoop/map.py -mapper "python map.py" -file py-hadoop/reduce.py -reducer "python reduce.py" -input /user/input -output /user/python-output
but still giving the exact same error result,
im sorry if my english is bad, im not a native speaker
already fixed it, the problem is caused by reduce.py, here's my new reduce.py
import sys
import collections
counter = collections.Counter()
for line in sys.stdin:
word, count = line.strip().split("\t", 1)
counter[word] += int(count)
for x in counter.most_common(9999):
print(x[0],"\t",x[1])
and here is the command line that i used to run
hadoop jar C:/hadoop/share/hadoop/tools/lib/hadoop-streaming-3.1.1.jar -file C:/py-hadoop/map.py -file C:/py-hadoop/reduce.py -mapper "python map.py" -reducer "python reduce.py" -input /user/input -output /user/python-output