apache-sparkbroadcasting

Where are broadcast variables stored in Spark?


As per official docs , "Broadcast variables allow the programmer to keep a read-only variable cached on each machine rather than shipping a copy of it with tasks"

Lets say in my spark-submit command i give -num-executors as 10 . My cluster is 2 node cluster and for now assume that 5 executors gets launched in node 1 and next 5 executors gets launched in node 2 .

scala> val broadcastVar = sc.broadcast(Array(1, 2, 3))
broadcastVar: org.apache.spark.broadcast.Broadcast[Array[Int]] = Broadcast(0)

As per doc , Does this broadcastVar will be available in storage memory of each executor, so that means the broadcastVar is available as 10 copies ?

or

Does this broadcastVar will be available in disk memory of each nodes . so 2 nodes each get a copy of broadcastVar and hence all the executors running from each node can fetch that broadcastVar?


Solution

  • Looking at how broadcast is implementated in TorrentBroadcast class:

    The driver divides the serialized object into small chunks and
    stores those chunks in the BlockManager of the driver.
    
    On each executor, the executor first attempts to fetch the object from its BlockManager. If
    it does not exist, it then uses remote fetches to fetch the small chunks from the driver and/or
    other executors if available. Once it gets the chunks, it puts the chunks in its own
    BlockManager, ready for other executors to fetch from. we can see that broadcast variables are stored in executor's BlockManager
    

    Therefore each executor gets its own copy, managed by its BlockManager.

    Same stands for accumulator variables.