pysparkpy4j

PySpark's Py4J Error: Why Does One Script Work While the Other Fails?


I have installed PySpark on my laptop. When I run the following program, everything works fine:

spark = SparkSession.builder.appName('pyspark').getOrCreate()
book_local = spark.read.text("data.txt")
book_local.show()

However, when I run the following program, an error is thrown:

spark = SparkSession.builder.appName('pyspark').getOrCreate()

my_grocery_list = [
    ["Banana", 2, 1.74],
    ["Apple", 4, 2.04],
    ["Carrot", 1, 1.09],
    ["Cake", 1, 10.99],
]
df_grocery_list = spark.createDataFrame(my_grocery_list)
df_grocery_list.show()   # This is where the error is thrown

The error message is:

Py4JJavaError: java.io.IOException: Cannot run program "python3"

After setting the environment variable, everything returned to normal.

import os
import sys
from pyspark.sql import SparkSession

os.environ['PYSPARK_PYTHON'] = sys.executable
os.environ['PYSPARK_DRIVER_PYTHON'] = sys.executable

My question is, why does the first program run without issues, but the second program throws a Py4J error? Does the first program not use the Py4J package at all?

Additionally, when I attempted to replace the environment variable configuration with the following code:

spark = SparkSession.builder.appName('pyspark').config("spark.pyspark.python", sys.executable).getOrCreate()

I still encountered an error.


Solution

  • The difference in behavior between your two scripts is due to the nature of the operations being performed and the default configuration of PySpark.

    First script works because no Python worker processes are required, as the operations are JVM-only.

    Second script fails because it requires Python worker processes, and PySpark cannot locate the python3 executable without proper configuration.

    Why .config("spark.pyspark.python", sys.executable) doesn't work?

    This approach sets the Python executable for worker processes, but it does not affect the driver process. If the driver process itself needs PYSPARK_PYTHON or PYSPARK_DRIVER_PYTHON, it will still fail unless those environment variables are set. To configure both driver and worker processes correctly via config, you need to set both spark.pyspark.python and spark.pyspark.driver.python:

    spark = SparkSession.builder \
        .appName('pyspark') \
        .config("spark.pyspark.python", sys.executable) \
        .config("spark.pyspark.driver.python", sys.executable) \
        .getOrCreate()