javaapache-sparkapache-spark-datasetapache-spark-2.0apache-spark-2.3

Sharing data across executors in Apache spark


My SPARK project (written in Java) requires to access (SELECT query results) different tables across executors.

One solution to this problem is :

  1. I create a tempView
  2. select required columns
  3. using forEach convert DataFrame to Map.
  4. pass that map as a broadcast variable across executors.

However, I have found that

  1. there many complex queries whose result cant be stored directly in Map
  2. Tables are very large and hence creating Map of large size and passing it to executors as a broadcast variable doesn't sound efficient.

Instead can we load tables in-memory using load which can be shared across executors?

Is void org.apache.spark.sql.Dataset.createOrReplaceTempView(String viewName)

or void org.apache.spark.sql.Dataset.createGlobalTempView(String viewName) throws AnalysisException

Method useful for this purpose?

SPARK VERSION : 2.3.0


Solution

  • You can broadcast a DataFrame. See documentation