I have a pyspark job available on GCP Dataproc to be triggered on airflow as shown below:
config = help.loadJSON("batch/config_file")
MY_PYSPARK_JOB = {
"reference": {"project_id": "my_project_id"},
"placement": {"cluster_name": "my_cluster_name"},
"pyspark_job": {
"main_python_file_uri": "gs://file/loc/my_spark_file.py"]
"properties": config["spark_properties"]
"args": <TO_BE_ADDED>
},
}
I need to supply command line arguments to this pyspark job as show below [this is how I am running my pyspark job from command line]:
spark-submit gs://file/loc/my_spark_file.py --arg1 val1 --arg2 val2
I am providing the arguments to my pyspark job using "configparser". Therefore, arg1 is the key and val1 is the value from my spark-submit commant above.
How do I define the "args" param in the "MY_PYSPARK_JOB" defined above [equivalent to my command line arguments]?
I finally managed to solve this conundrum. If we are making use of ConfigParser, the key has to be specified as below [irrespective of whether the argument is being passed as command or on airflow]:
--arg1
In airflow, the configs are passed as a Sequence[str] (as mentioned by @Betjens below) and each argument is defined as follows:
arg1=val1
Therefore, as per my requirement, command line arguments are defined as depicted below:
"args": ["--arg1=val1",
"--arg2=val2"]
PS: Thank you @Betjens for all your suggestions.