I have deployed a 3-node AWS ElasticMapReduce cluster bootstrapped with Apache Spark. From my local machine, I can access the master node by SSH:
ssh -i <key> hadoop@ec2-xx-xxx-xxx-xxx.compute-1.amazonaws.com
Once ssh'd into the master node, I can access PySpark via pyspark
.
Additionally, (although insecure) I have configured my master node's security group to accept TCP traffic from my local machine's IP address specifically on port 7077
.
However, I am still unable to connect my local PySpark instance to my cluster:
MASTER=spark://ec2-master-node-public-address:7077 ./bin/pyspark
The above command results in a number of exceptions and causes PySpark to unable to initialize a SparkContext object.
Does anyone know how to successfully create a remote connection like the one I am describing above?
Unless your local machine is the master node for your cluster, you cannot do that. You won't be able to do that with AWS EMR.