apache-sparkpysparkamazon-emr

How to set Environment variable in AWS EMR using SSM to be used by pyspark scripts


I am using emr-6.12.0 and trying to set environment varibles which are stored in secret manager in bootstrap.sh file.

SECRET_NAME="/myapp/dev/secrets"
SECRETS_JSON=$(aws secretsmanager get-secret-value --secret-id $SECRET_NAME --query SecretString --output text)

# Parse the secrets and set them as environment variables
for key in $(echo "$SECRETS_JSON" | jq -r "keys[]"); do
  value=$(echo "$SECRETS_JSON" | jq -r ".$key // empty" | sed 's/"/\\"/g')
  echo "$value"
  if [ ! -z "$value" ]; then
    export "$key"="$value"
  fi
done

I am able to see these values in log.

but when I try to access these variables from my pyspark script, I am not able to get these env variables.

os.environ.get("POSTGRES_URL") // Returns None

for key, value in os.environ.items():
    self.logger.info(f"{key}: {value}") // not able to see my env variables

As I am new to EMR and spark, please help me to know how can I set my env variables from SSM to EMR.


Solution

  • In order to retrieve secrets from amazon secret manager in your python application, you'll need to follow the following steps:

    pip install aws-secretsmanager-caching 
    

    After that, in your app.py, you'll have something like this:

    import botocore 
    import botocore.session 
    from aws_secretsmanager_caching import SecretCache, SecretCacheConfig 
    
    client = botocore.session.get_session().create_client('secretsmanager')
    cache_config = SecretCacheConfig()
    cache = SecretCache( config = cache_config, client = client)
    
    secret = cache.get_secret_string('mysecret')
    

    NB: You must have the following:

    Required permission:

    Official doc