I was following this article to encapsule the fuzzy-c-means lib to run on a spark cluster, I'm using bitnami/spark image on docker. I've used a python image to build a venv with python 3.7 and install the fuzzy-c-means lib. then i used the venv-pack to compress the venv in a environment.tar.gz file.
I have a app.py file:
from pyspark.sql import SparkSession
def main(spark):
import fcmeans
print('-')
if __name__ == "__main__":
print('log')
spark = (
SparkSession.builder
.getOrCreate()
)
main(spark)
So when I run my spark-submit code I got the error: Exception in thread "main" java.io.IOException: Cannot run program "./environment/bin/python": error=2, No such file or directory
.
spark-sumit code:
PYSPARK_PYTHON=./environment/bin/python spark-submit --archives ./environment.tar.gz#environment ./app.py
I can run the app.py with the .tar.gz file if I remove the statement PYSPARK_PYTHON
but I'll have the no module named 'fcmeans' for the import in my app.py.
The thing is, when run --archives ./environment.tar.gz#environment it unpack the tar.gz files in the /tmp/spark-uuid-code/userFiles-uuid-code/environment/ And when i set the PYSPARK_PYTHON it not recongnizes the path to the file has a valid file, but it seens that the spark should manage this.
Any hints of what I should do?
I've managed to make it work by creating the virtualenv inside the EMR cluster, then exporting the .tar.gz file with venv-pack to a S3 bucket. This article helped: gist.github.
Inside the EMR shell:
# Create and activate our virtual environment
virtualenv -p python3 venv-datapeeps
source ./venv-datapeeps/bin/activate
# Upgrade pip and install a couple libraries
pip3 install --upgrade pip
pip3 install fuzzy-c-means boto3 venv-pack
# Package the environment and upload
venv-pack -o pyspark_venv.tar.gz
aws s3 cp pyspark_venv.tar.gz s3://<BUCKET>/artifacts/pyspark/