When we broadcast a dataframe (let's say broadcast joins), it broadcasts the same copy of dataframe to all the executors, i.e. each executor (not node, but executor) will hold one copy of the data. So if a node has 3 executors, it unnecessarily holds 3 copies of the same data.
My question is, why can't it use off-heap space? Let's say I give sufficient off-heap memory so why can't it store this data there, just one copy per node and all executors in that node can read from this off heap space.
That's a good question, let me first clarify something:
Why can't it use off-heap space?
Alternatives:
Resources used:
1 - https://blog.devgenius.io/spark-on-heap-and-off-heap-memory-27b625af778b