pythonapache-sparkpysparkpython-multiprocessing

PySpark executing queries from different processes


Is there any way to have two separate processes executing queries on Spark? Something like:

def process_1():
   spark_context = SparkSession.builder.getOrCreate()
   data1 = spark_context.sql("SELECT * FROM table 1").toPandas()
   do_processing(data1)


def process_2():
   spark_context = SparkSession.builder.getOrCreate()
   data1 = spark_context.sql("SELECT * FROM table 2").toPandas()
   do_processing(data1)

p1 = Process(target=process_1)
p1.start()
p2 = Process(target=process_2)
p2.start()

p1.join()
p2.join()

The problem is how to create separate SparkContexts for processes or how to pass one context between processes?


Solution

  • PySpark holds its Context as a singleton object.

    Only one SparkContext should be active per JVM. You must stop() the active SparkContext before creating a new one.

    SparkContext instance is not supported to share across multiple processes out of the box, and PySpark does not guarantee multi-processing execution. Use threads instead for concurrent processing purpose.

    As for your "out of memory" problem (in your code): that could be caused by DF.toPandas() which significantly increases memory usage.
    Consider writing the loaded data into parquet files and optimize computations with pyarrow functionality.