python-3.xgoogle-cloud-platformairflowgoogle-cloud-dataprocgoogle-cloud-composer

How to get jobId that was submitted using Dataproc Workflow Template


I have submitted a Hive job using Dataproc Workflow Template with the help of Airflow operator (DataprocWorkflowTemplateInstantiateInlineOperator) written in Python. Once the job is submitted some name will be assigned as jobId (example: job0-abc2def65gh12).

Since I was not able to get jobId I tried to pass jobId as a parameter from REST API which isn't working.

Can I fetch jobId or, if it's not possible, can I pass jobId as a parameter?


Solution

  • The JobId will be available as part of metadata field in Operation object that is returned from Instantiate operation. See this [1] article for how to work with metadata.

    The Airflow operator only polls [2] on the Operation but does not return the final Operation object. You could try to add a return to execute.

    Another option would to be to use dataproc rest API [3] after workflow finishes. Any labels assigned to the workflow itself will be propagated to clusters and jobs so you can do a list jobs call. For example the filter parameter could look like: filter = labels.my-label=12345

    [1] https://cloud.google.com/dataproc/docs/concepts/workflows/debugging#using_workflowmetadata

    [2] https://github.com/apache/airflow/blob/master/airflow/contrib/operators/dataproc_operator.py#L1376

    [3] https://cloud.google.com/dataproc/docs/reference/rest/v1/projects.regions.jobs/list