pythonwindowspyspark

Pyenv - Switching between Python and PySpark versions without hardcoding environment variable paths for python


I have trouble getting different versions of PySpark to work correctly on my windows machine in combination with different versions of Python installed via PyEnv.

The setup:

  1. I installed pyenv and let it set the environment variables (PYENV, PYENV_HOME, PYENV_ROOT and the entry in PATH)
  2. I installed Amazon Coretto Java JDK (jdk1.8.0_412) and set the JAVA_HOME environment variable.
  3. I downloaded the winutils.exe & hadoop.dll from here and set the HADOOP_HOME environment variable.
  4. Via pyenv I installed Python 3.10.10 and then pyspark 3.4.1
  5. Via pyenv I installed Python 3.8.10 and then pyspark 3.2.1

Python works as expected:

But I'm having trouble with PySpark.

For one, I cannot start PySpark via the powershell console by running pyspark >>> The term 'pyspark' is not recognized as the name of a cmdlet, function, script file.....

More annoyingly, my repo-scripts (with a .venv created via pyenv & poetry) also fail:

However, both work after I add the following two entries to the PATH environment variable:

but I would have to "hardcode" the Python Version - which is exactly what I don't want to do while using pyenv.

If I hardcode the path, even if I switch to another Python version (pyenv global 3.8.10), once I run pyspark in Powershell, the version PySpark 3.4.1 starts from the environment PATH entry for Python 3.10.10. I also cannot just do anything with python in the command line as it always points to the hardcoded python version, no matter what I do with pyenv.

I was hoping to be able to start PySpark 3.2.1 from Python 3.8.10 which I just "activated" with pyenv globally.

What do I have to do to be able to switch between the Python installations (and thus also between PySparks) with pyenv without "hardcoding" the Python paths?

Example PySpark script:

from pyspark.sql import SparkSession
spark = (
    SparkSession
    .builder
    .master("local[*]")
    .appName("myapp")
    .getOrCreate()
)
data = [("Finance", 10),
        ("Marketing", 20),
        ]
df = spark.createDataFrame(data=data)
df.show(10, False)

Solution

  • I "solved" the issue by completely removing the Python path from the PATH environment variable and doing everything exclusively via pyenv. I suppose my original task is not possible.

    I can still start a Python process by running pyenv exec python in the terminal.

    But disappointingly I cannot launch a Spark process from the terminal anymore.

    At least my repositories work as expected when setting the pyenv versions (pyenv local 3.8.10 / pyenv global 3.10.10).