I created an EMR cluster on AWS with Spark and Livy. I submitted a custom JAR with some additional libraries (e.g. datasources for custom formats) as a custom JAR step. However, the stuff from the custom JAR is not available when I try to access it from Livy.
What do I have to do to make the custom stuff available in the environment?
I am posting this as an answer to be able to accept it - I figured it out thanks to Yuval Itzchakov's comments and the AWS documentation on Custom Bootstrap Actions.
So here is what I did:
sbt assembly
containing everything needed) into an S3 bucketCreated a script named copylib.sh
which contains the following:
#!/bin/bash
mkdir -p /home/hadoop/mylib
aws s3 cp s3://mybucket/mylib.jar /home/hadoop/mylib
Created the following configuration JSON and put it into the same bucket besides the mylib.jar
and copylib.sh
:
[{
"configurations": [{
"classification": "export",
"properties": {
"PYSPARK_PYTHON": "/usr/bin/python3"
}
}],
"classification": "spark-env",
"properties": {}
}, {
"configurations": [{
"classification": "export",
"properties": {
"PYSPARK_PYTHON": "/usr/bin/python3"
}
}],
"classification": "yarn-env",
"properties": {}
},
{
"Classification": "spark-defaults",
"Properties": {
"spark.executor.extraClassPath": "/usr/lib/hadoop-lzo/lib/*:/usr/lib/hadoop/hadoop-aws.jar:/usr/share/aws/aws-java-sdk/*:/usr/share/aws/emr/emrfs/conf:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/emrfs/auxlib/*:/usr/share/aws/emr/goodies/lib/emr-spark-goodies.jar:/usr/share/aws/emr/security/conf:/usr/share/aws/emr/security/lib/*:/usr/share/aws/hmclient/lib/aws-glue-datacatalog-spark-client.jar:/usr/share/java/Hive-JSON-Serde/hive-openx-serde.jar:/usr/share/aws/sagemaker-spark-sdk/lib/sagemaker-spark-sdk.jar:/usr/share/aws/emr/s3select/lib/emr-s3-select-spark-connector.jar:/home/hadoop/mylib/mylib.jar",
"spark.driver.extraClassPath": "/usr/lib/hadoop-lzo/lib/*:/usr/lib/hadoop/hadoop-aws.jar:/usr/share/aws/aws-java-sdk/*:/usr/share/aws/emr/emrfs/conf:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/emrfs/auxlib/*:/usr/share/aws/emr/goodies/lib/emr-spark-goodies.jar:/usr/share/aws/emr/security/conf:/usr/share/aws/emr/security/lib/*:/usr/share/aws/hmclient/lib/aws-glue-datacatalog-spark-client.jar:/usr/share/java/Hive-JSON-Serde/hive-openx-serde.jar:/usr/share/aws/sagemaker-spark-sdk/lib/sagemaker-spark-sdk.jar:/usr/share/aws/emr/s3select/lib/emr-s3-select-spark-connector.jar:/home/hadoop/mylib/mylib.jar"
}
}
]
The classifications for spark-env
and yarn-env
are needed for PySpark to work with Python3 on EMR through Livy. And there is another issue: EMR already populates the two extraClassPath
s with a lot of libraries which are needed for EMR to function properly, so I had to run a cluster without my lib, extract these settings from spark-defaults.conf
and adjust my classification afterwards. Otherwise, things like S3 access wouldn't work.
When creating the cluster, in Step 1 I referenced the configuration JSON file from above in Edit software settings
, and in Step 3, I configured copylib.sh
as a Custom Bootstrap Action.
I can now open the Jupyterhub of the cluster, start a notebook and work with my added functions.