pythonapache-sparkpyspark

How to start a standalone cluster using pyspark?


I am using pyspark under ubuntu with python 2.7 I installed it using

pip install pyspark --user 

And trying to follow the instruction to setup spark cluster

I can't find the script start-master.sh I assume that it has to do with the fact that i installed pyspark and not regular spark

I found here that i can connect a worker node to the master via pyspark, but how do i start the master node with pyspark?


Solution

  • Well i did a bit of a mix-up in the op.

    You need to get spark on the machine that should run as master. You can download it here

    After extracting it, you have spark/sbin folder, there you have start-master.sh script. you need to start it with -h argument.

    please note that you need to create a spark-env file like explained here and define the spark local and master variables, this is important on the master machine.

    After that, on the worker nodes, use the start-slave.sh script to start worker nodes.

    And you are good to go, you can use a spark context inside python to use it!