I have trouble getting different versions of PySpark to work correctly on my windows machine in combination with different versions of Python installed via PyEnv.
The setup:
Python works as expected:
pyenv global <version>
python --version
in PowerShell it always shows the version that I set before with pyenv.But I'm having trouble with PySpark.
For one, I cannot start PySpark via the powershell console by running pyspark
>>> The term 'pyspark' is not recognized as the name of a cmdlet, function, script file....
.
More annoyingly, my repo-scripts (with a .venv created via pyenv & poetry) also fail:
Caused by: java.io.IOException: Cannot run program "python3": CreateProcess error=2, The system cannot find the file specified
[...] Caused by: java.io.IOException: CreateProcess error=2, The system cannot find the file specified
However, both work after I add the following two entries to the PATH environment variable:
but I would have to "hardcode" the Python Version - which is exactly what I don't want to do while using pyenv.
If I hardcode the path, even if I switch to another Python version (pyenv global 3.8.10
), once I run pyspark
in Powershell, the version PySpark 3.4.1 starts from the environment PATH entry for Python 3.10.10. I also cannot just do anything with python in the command line as it always points to the hardcoded python version, no matter what I do with pyenv.
I was hoping to be able to start PySpark 3.2.1 from Python 3.8.10 which I just "activated" with pyenv globally.
What do I have to do to be able to switch between the Python installations (and thus also between PySparks) with pyenv without "hardcoding" the Python paths?
Example PySpark script:
from pyspark.sql import SparkSession
spark = (
SparkSession
.builder
.master("local[*]")
.appName("myapp")
.getOrCreate()
)
data = [("Finance", 10),
("Marketing", 20),
]
df = spark.createDataFrame(data=data)
df.show(10, False)
I "solved" the issue by completely removing the Python path from the PATH
environment variable and doing everything exclusively via pyenv. I suppose my original task is not possible.
I can still start a Python process by running pyenv exec python
in the terminal.
But disappointingly I cannot launch a Spark process from the terminal anymore.
At least my repositories work as expected when setting the pyenv versions (pyenv local 3.8.10
/ pyenv global 3.10.10
).