I'm submitting a Spark driver program to the EMR cluster, and it need to use a jar uploaded by me, so mu code was like:
boto3.client("emr-containers").start_job_run(
name=job_name,
virtualClusterId=self.virtual_cluster_id,
releaseLabel="emr-6.11.0-latest",
executionRoleArn=role,
jobDriver={
"sparkSubmitJobDriver": {
"entryPoint": entry_point,
"entryPointArguments": entry_point_args,
"sparkSubmitParameters": '--driver-class-path s3://my_bucket/mysql-connector-j-8.0.32.jar --jars s3://my_bucket/mysql-connector-j-8.0.32.jar --conf spark.kubernetes.driver.podTemplateFile=my_file.yaml --conf spark.kubernetes.executor.podTemplateFile=my_file.yaml',
}
},
)
But this would cause EMR throwns an exception:
Exception in thread "main" java.lang.RuntimeException: java.lang.ClassNotFoundException: Class com.amazon.ws.emr.hadoop.fs.EmrFileSystem not found
According to documents and some other post in stackoverflow, I suspect it was because my driver-class-path
and jars
args had overwritten the default setting of EMR. So what args should I parse to EMR in order to use my own jar in EMR and avoid the problem above?
On EMR Serverless you can add jars in sparkSubmitParameters
like this: --conf spark.jars=s3://my-bucket/multiple-jars/*
. I suspect it should be similar for EMR on EKS link