I am trying to run this example from mrjob about running a word count MapReduce job on AWS EMR.
This is the word count code example from mrjob
:
from mrjob.job import MRJob
class MRWordFrequencyCount(MRJob):
def mapper(self, _, line):
yield "chars", len(line)
yield "words", len(line.split())
yield "lines", 1
def reducer(self, key, values):
yield key, sum(values)
if __name__ == '__main__':
MRWordFrequencyCount.run()
My mrjob.conf
file:
runners:
emr:
aws_access_key_id: <my_key_id>
aws_secret_access_key: <my_access_key>
region: ap-southeast-1
subnet: subnet-9a2f90fc
ec2_key_pair: EMR
ec2_key_pair_file: ~/.ssh/EMR.pem
ssh_tunnel: true
Run command:
python word_count.py -r emr --cluster-id=j-CLUSTER_ID readme.rst --conf-path mrjob.conf
My problem is I can run this example if I choose the Application of my cluster is Core Hadoop
, I cannot run it with the Spark
application option.
This is the error when running with Spark EMR cluster:
Waiting for Step 1 of 1 (s-xxx) to complete...
PENDING (cluster is RUNNING: Running step)
FAILED
Cluster j-CLUSTER_ID is WAITING: Cluster ready after last step failed.
I want to run this with Spark
because my application involves some Spark code and some MapReduce code.
How can I fix this problem?
I found that I can create a cluster with Hadoop and Spark installed. In the Create Cluster - Quick Options
menu, go to Go to advanced options
.
Select Spark
and continue to set up your cluster normally.
After the cluster is created, I can run both MapReduce and Spark applications on this cluster.