Is there any way to have two separate processes executing queries on Spark? Something like:
def process_1():
spark_context = SparkSession.builder.getOrCreate()
data1 = spark_context.sql("SELECT * FROM table 1").toPandas()
do_processing(data1)
def process_2():
spark_context = SparkSession.builder.getOrCreate()
data1 = spark_context.sql("SELECT * FROM table 2").toPandas()
do_processing(data1)
p1 = Process(target=process_1)
p1.start()
p2 = Process(target=process_2)
p2.start()
p1.join()
p2.join()
The problem is how to create separate SparkContexts for processes or how to pass one context between processes?
PySpark holds its Context as a singleton object.
Only one
SparkContext
should be active per JVM. You muststop()
the activeSparkContext
before creating a new one.
SparkContext
instance is not supported to share across multiple processes out of the box, and PySpark does not guarantee multi-processing execution. Use threads instead for concurrent processing purpose.
As for your "out of memory" problem (in your code): that could be caused by DF.toPandas()
which significantly increases memory usage.
Consider writing the loaded data into parquet files and optimize computations with pyarrow functionality.