pythonapache-sparkamazon-emrmrjob

Cannot run MapReduce job on AWS EMR Spark application


I am trying to run this example from mrjob about running a word count MapReduce job on AWS EMR.

This is the word count code example from mrjob:

from mrjob.job import MRJob

class MRWordFrequencyCount(MRJob):

    def mapper(self, _, line):
        yield "chars", len(line)
        yield "words", len(line.split())
        yield "lines", 1

    def reducer(self, key, values):
        yield key, sum(values)


if __name__ == '__main__':
    MRWordFrequencyCount.run()

My mrjob.conf file:

runners:
  emr:
    aws_access_key_id: <my_key_id>
    aws_secret_access_key: <my_access_key>
    region: ap-southeast-1
    subnet: subnet-9a2f90fc
    ec2_key_pair: EMR
    ec2_key_pair_file: ~/.ssh/EMR.pem
    ssh_tunnel: true

Run command:

python word_count.py -r emr --cluster-id=j-CLUSTER_ID readme.rst --conf-path mrjob.conf

My problem is I can run this example if I choose the Application of my cluster is Core Hadoop, I cannot run it with the Spark application option.

enter image description here

This is the error when running with Spark EMR cluster:

Waiting for Step 1 of 1 (s-xxx) to complete...
  PENDING (cluster is RUNNING: Running step)
  FAILED
Cluster j-CLUSTER_ID is WAITING: Cluster ready after last step failed.

I want to run this with Spark because my application involves some Spark code and some MapReduce code.

How can I fix this problem?


Solution

  • I found that I can create a cluster with Hadoop and Spark installed. In the Create Cluster - Quick Options menu, go to Go to advanced options.

    enter image description here

    Select Spark and continue to set up your cluster normally.

    After the cluster is created, I can run both MapReduce and Spark applications on this cluster.