python-2.7mapreducebigdataamazon-emrhadoop-streaming

Amazon EMR MapReduce streaming program terminated with errors


I tried to run the "word count" mapReduce program with Hadoop streaming. My code for the mapper is perfect. It works fine on my local Linux machine and the Cloudera VM. But when I use Amazon AWS EMR, it never succeeded. It is just a few lines of code and I have no clue what went wrong.

The code is actually the sample code from Yandex through Coursera (the Big Data course I am taking now).

Here is the code:

#!/usr/bin/python
import sys
import re

reload(sys)
sys.setdefaultencoding('utf-8')

for line in sys.stdin:
    try:
        article_id, text = unicode(line.strip()).split('\t', 1)
    except ValueError as e:
        continue
    text = re.sub("^\W+|\W+$", "", text, flags=re.UNICODE)
    words = re.split("\W*\s+\W*", text, flags=re.UNICODE)   
    for word in words:
        print "%s\t%d" % (word.lower(), 1)

This was generated by EMR:

hadoop-streaming -files s3://doc-sim/Python2code/word_count_test.py \ 
-mapper "word_count_test.py" \
-reducer aggregate \
-input s3://doc-sim/datasets/articles-part.txt \
-output s3://doc-sim/results/output2/

I kept getting this error from AWS EMR :

Error: java.lang.StringIndexOutOfBoundsException: String index out of range: -1
    at java.lang.String.substring(String.java:1967)
    at org.apache.hadoop.mapred.lib.aggregate.ValueAggregatorCombiner.reduce(ValueAggregatorCombiner.java:59)
    at org.apache.hadoop.mapred.lib.aggregate.ValueAggregatorCombiner.reduce(ValueAggregatorCombiner.java:36)
    at org.apache.hadoop.mapred.Task$OldCombinerRunner.combine(Task.java:1702)
    at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1657)
    at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1509)
    at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:463)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:344)
    at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:175)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:422)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1844)
    at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:169)
...

I hope someone can help, otherwise I won't use Amazon anymore.


Solution

  • I think the problem is the reducer. You seem like you are not specifying the reducer, so, trying deleting this line -reducer aggregate. Remember that you are using hadoop streaming and you should indicate all the mappers and reducers. Another thing is that you are texting the mapper with ". Please remove it, you dont need to specify in that way, just word_count_test.py.