amazon-web-servicespysparkaws-step-functionsemr-serverless

How to pass EMR Serverless PySpark entryPointArguments as variable


I have an EMR Serverless PySpark job I am launching from a step function. I am trying to pass arguments to SparkSubmit from the entryPointArguments in the form of variables set in the beginning of the step function i.e. today_date, source, tuned_parameters, which I then use in the PySpark code.

I was able to find a partial solution in this post here however I am trying to pass variables from the step function and not the hardcoded argument i.e.. "prd".

        "JobDriver": {
          "SparkSubmit": {
            "EntryPoint": "s3://xxxx-my-code/test/my_code_edited_3.py",
            "EntryPointArguments": ["-env", "prd", "-source.$", "$.source"]
          }
        }

Using argparse I am able to read the first argument "-env" and it is successfully returning "prd", however I am having troubles figuring out how to pass a variable for the source argument.


Solution

  • Managed to find an answer for this question. Passing variable arguments to EMR Serverless SparkSubmit is achieved with AmazonStateLanguage intrinsic functions.

    Provided that the JSON input to the StepFunction is:

        {
      "source": "mysource123",
        }
    

    The correct way to pass this variable argument in the EntryPointArgument is:

    "EntryPointArguments.$": "States.Array('-source', $.source)"
    

    Then, using argparse one can read this variable in the PySpark job in EMR Serverless.

    import argparse
    
    parser = argparse.ArgumentParser()
    parser.add_argument("-source")
    args = parser.parse_args()
    print(args.source)
    

    The result of the print statement is mysource123.