google-cloud-platformpysparkairflowgoogle-cloud-dataproc

Submit command line arguments to a pyspark job on airflow


I have a pyspark job available on GCP Dataproc to be triggered on airflow as shown below:

config = help.loadJSON("batch/config_file")

MY_PYSPARK_JOB = {
    "reference": {"project_id": "my_project_id"},
    "placement": {"cluster_name": "my_cluster_name"},
    "pyspark_job": {
        "main_python_file_uri": "gs://file/loc/my_spark_file.py"]
        "properties": config["spark_properties"]
        "args": <TO_BE_ADDED>
    },
}

I need to supply command line arguments to this pyspark job as show below [this is how I am running my pyspark job from command line]:

spark-submit gs://file/loc/my_spark_file.py --arg1 val1 --arg2 val2

I am providing the arguments to my pyspark job using "configparser". Therefore, arg1 is the key and val1 is the value from my spark-submit commant above.

How do I define the "args" param in the "MY_PYSPARK_JOB" defined above [equivalent to my command line arguments]?


Solution

  • I finally managed to solve this conundrum. If we are making use of ConfigParser, the key has to be specified as below [irrespective of whether the argument is being passed as command or on airflow]:

    --arg1
    

    In airflow, the configs are passed as a Sequence[str] (as mentioned by @Betjens below) and each argument is defined as follows:

    arg1=val1
    

    Therefore, as per my requirement, command line arguments are defined as depicted below:

    "args": ["--arg1=val1",
        "--arg2=val2"]
    

    PS: Thank you @Betjens for all your suggestions.