apache-sparkamazon-ec2pysparkemr

How can I connect PySpark (local machine) to my EMR cluster?


I have deployed a 3-node AWS ElasticMapReduce cluster bootstrapped with Apache Spark. From my local machine, I can access the master node by SSH:

ssh -i <key> hadoop@ec2-xx-xxx-xxx-xxx.compute-1.amazonaws.com Once ssh'd into the master node, I can access PySpark via pyspark. Additionally, (although insecure) I have configured my master node's security group to accept TCP traffic from my local machine's IP address specifically on port 7077.

However, I am still unable to connect my local PySpark instance to my cluster:

MASTER=spark://ec2-master-node-public-address:7077 ./bin/pyspark

The above command results in a number of exceptions and causes PySpark to unable to initialize a SparkContext object.

Does anyone know how to successfully create a remote connection like the one I am describing above?


Solution

  • Unless your local machine is the master node for your cluster, you cannot do that. You won't be able to do that with AWS EMR.