When I try to run a pyspark step on my EMR cluster I get an error Caused by: java.lang.ClassNotFoundException: Class com.amazon.ws.emr.hadoop.fs.EmrFileSystem not found
. My understanding from AWS documentation is that the EMR file system should already be installed on my EMR cluster? I also tried referencing my .py file in s3 using s3a instead, and get a similar error saying the S3a file system can't be found.
Here's how I'm creating the EMR step:
aws emr add-steps --cluster-id j-XXXXXXXXXXXXX \
--steps 'Type=spark,Name=Bronze,Args=[ --deploy-mode,cluster, --master,yarn, --conf,spark.yarn.submit.waitAppCompletion=true,s3://my-bucket/spark-scripts/spark_streaming.py],ActionOnFailure=CONTINUE'
And my bootstrap for my cluster is:
#!/bin/bash
sudo curl -O --output-dir /usr/lib/spark/jars/ https://repo1.maven.org/maven2/io/delta/delta-spark_2.12/3.2.1/delta-spark_2.12-3.2.1.jar
sudo curl -O --output-dir /usr/lib/spark/jars/ https://repo1.maven.org/maven2/io/delta/delta-storage/3.2.1/delta-storage-3.2.1.jar
sudo curl -O --output-dir /usr/lib/spark/jars/ https://repo1.maven.org/maven2/software/amazon/awssdk/sqs/2.29.6/sqs-2.29.6.jar
sudo curl -O --output-dir /usr/lib/spark/jars/ https://awslabs-code-us-east-1.s3.amazonaws.com/spark-streaming-sql-s3-connector/spark-streaming-sql-s3-connector-0.0.1.jar
sudo curl -O --output-dir /usr/lib/spark/jars/ https://jdbc.postgresql.org/download/postgresql-42.7.4.jar
sudo python3 -m pip install delta-spark==3.2.1
sudo python3 -m pip install boto3
I resolved this by removing:
pip install delta-spark==3.2.1
As mentioned in similar questions, overwriting EMR's spark installation causes this issue. I was unintentionally reinstalling spark via this pip install.
So now to use the delta library in Jupyter I reference it with a magic command, and for EMR Steps I reference the JAR as a conf.