I have an EMR Serverless PySpark job I am launching from a step function. I am trying to pass arguments to SparkSubmit from the entryPointArguments in the form of variables set in the beginning of the step function i.e. today_date, source, tuned_parameters, which I then use in the PySpark code.
I was able to find a partial solution in this post here however I am trying to pass variables from the step function and not the hardcoded argument i.e.. "prd".
"JobDriver": {
"SparkSubmit": {
"EntryPoint": "s3://xxxx-my-code/test/my_code_edited_3.py",
"EntryPointArguments": ["-env", "prd", "-source.$", "$.source"]
}
}
Using argparse I am able to read the first argument "-env" and it is successfully returning "prd", however I am having troubles figuring out how to pass a variable for the source argument.
Managed to find an answer for this question. Passing variable arguments to EMR Serverless SparkSubmit is achieved with AmazonStateLanguage intrinsic functions.
Provided that the JSON input to the StepFunction is:
{
"source": "mysource123",
}
The correct way to pass this variable argument in the EntryPointArgument is:
"EntryPointArguments.$": "States.Array('-source', $.source)"
Then, using argparse one can read this variable in the PySpark job in EMR Serverless.
import argparse
parser = argparse.ArgumentParser()
parser.add_argument("-source")
args = parser.parse_args()
print(args.source)
The result of the print statement is mysource123.