I'm trying to create a dataframe using the createDataFrame method and I got the error for the following code,
from pyspark.sql import SparkSession
# Create a Spark session
spark = SparkSession.builder\
.appName("MyApp") \
.getOrCreate()
person = spark.createDataFrame([
(0, "AA", 0),
(1, "BB", 1),
(2, "CC", 1)
],schema= ["id", "name", "graduate"])
person.take(6)
This is the error I got
Py4JJavaError: An error occurred while calling o43.collectToPython.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0) (QUASAR executor driver): org.apache.spark.SparkException: Python worker failed to connect back.
at org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:203)
at org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:109)
at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:124)
at org.apache.spark.api.python.BasePythonRunner.compute(PythonRunner.scala:174)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:67)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:367)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:331)
but when I import the data from csv or anyother files, then everything works fine, here is the working example for that
flightData2015 = spark.read.option("inferSchema", "true").option("header","true").csv("flight-data\\csv\\2015-summary.csv")
flightData2015.take(5)
Not sure why I get error when I try to print the DF created using createDataFrame method
Seems like I missed to set the following environment variables,
PYSPARK_PYTHON=path\python
PYSPARK_DRIVER_PYTHON=path\python
After setting above variables in the env, everything works fine.
but I still wonder why only the DF created using spark.createDataFrame()
failed, but the one created using spark.read()
worked when the above env variables are missing. Please let me know.