apache-sparkpysparkapache-spark-sql

org.apache.spark.SparkException: Python worker failed to connect back


I'm trying to create a dataframe using the createDataFrame method and I got the error for the following code,

from pyspark.sql import SparkSession
# Create a Spark session
spark = SparkSession.builder\
        .appName("MyApp") \
        .getOrCreate()
person = spark.createDataFrame([
    (0, "AA", 0),
    (1, "BB", 1),
    (2, "CC", 1)
],schema= ["id", "name", "graduate"])
person.take(6)

This is the error I got

Py4JJavaError: An error occurred while calling o43.collectToPython.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0) (QUASAR executor driver): org.apache.spark.SparkException: Python worker failed to connect back.
    at org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:203)
    at org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:109)
    at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:124)
    at org.apache.spark.api.python.BasePythonRunner.compute(PythonRunner.scala:174)
    at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:67)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:367)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:331)

but when I import the data from csv or anyother files, then everything works fine, here is the working example for that

flightData2015 = spark.read.option("inferSchema", "true").option("header","true").csv("flight-data\\csv\\2015-summary.csv")
flightData2015.take(5)

Not sure why I get error when I try to print the DF created using createDataFrame method


Solution

  • Seems like I missed to set the following environment variables,

    PYSPARK_PYTHON=path\python
    PYSPARK_DRIVER_PYTHON=path\python
    

    After setting above variables in the env, everything works fine.

    but I still wonder why only the DF created using spark.createDataFrame() failed, but the one created using spark.read() worked when the above env variables are missing. Please let me know.