apache-spark

What is the difference between Spark Standalone, YARN and local mode?


Spark Standalone:

In this mode I realized that you run your Master and worker nodes on your local machine.

Does that mean you have an instance of YARN running on my local machine? Since when I installed Spark it came with Hadoop and usually YARN also gets shipped with Hadoop as well correct? And in this mode I can essentially simulate a smaller version of a full blown cluster.

Spark Local Mode:

This is the part I am also confused on. To run it in this mode I do val conf = new SparkConf().setMaster("local[2]").

In this mode, it doesn't use any type of resource manager (like YARN) correct? Like it simply just runs the Spark Job in the number of threads which you provide to "local[2]"\?


Solution

  • You are getting confused with Hadoop YARN and Spark.

    YARN is a software rewrite that decouples MapReduce's resource management and scheduling capabilities from the data processing component, enabling Hadoop to support more varied processing approaches and a broader array of applications.

    With the introduction of YARN, Hadoop has opened to run other applications on the platform.

    In short YARN is "Pluggable Data Parallel framework".

    Apache Spark

    Apache spark is a Batch interactive Streaming Framework. Spark has a "pluggable persistent store". Spark can run with any persistence layer.

    For spark to run it needs resources. In standalone mode you start workers and spark master and persistence layer can be any - HDFS, FileSystem, cassandra etc. In YARN mode you are asking YARN-Hadoop cluster to manage the resource allocation and book keeping.

    When you use master as local[2] you request Spark to use 2 core's and run the driver and workers in the same JVM. In local mode all spark job related tasks run in the same JVM.

    So the only difference between Standalone and local mode is that in Standalone you are defining "containers" for the worker and spark master to run in your machine (so you can have 2 workers and your tasks can be distributed in the JVM of those two workers?) but in local mode you are just running everything in the same JVM in your local machine.