azureazure-devopsdatabricksazure-databricksdatabricks-cli

Azure Databricks CLI: update workflow/job definition


I have created a pipeline in Azure DevOps to perform the following three steps:

  1. Retrieve the job definition from one Databricks workspace and save it as a json (Databricks CLI config is omitted)

    databricks jobs get --job-id $(job_id) > workflow.json

  2. Use this json to update the workflow in a second (separate) Databricks workspace (Databricks CLI is first reconfigured to point to the new workspace)

    databricks jobs reset --job-id $(job_id) --json-file workflow.json

  3. Run the updated job in the second Databricks workspace

    databricks jobs run-now --job-id $(job_id)

However, my pipeline fails at step 2 with the following error, even though the existing_cluster_id is already defined inside the workflow.json. Any idea?

Error: b'{"error_code":"INVALID_PARAMETER_VALUE","message":"One of job_cluster_key, new_cluster, or existing_cluster_id must be specified."}' 

Here is what my workflow.json looks like (hiding some of the details):

  {
     "job_id": 123,
     "creator_user_name": "user1",
     "run_as_user_name": "user1",
     "run_as_owner": true,
     "settings":
         {
             "name": "my-workflow",
             "existing_cluster_id": "abc-def-123-xyz",
             "email_notifications": {
                 "no_alert_for_skipped_runs": false
             },
             "webhook_notifications": {},
             "timeout_seconds": 0,
             "notebook_task": {
                "notebook_path": "notebooks/my-notebook",
                "base_parameters": {
                    "environment": "production"
                },
                "source": "GIT"
             },
             "max_concurrent_runs": 1,
             "git_source": {
                 "git_url": "https://my-org@dev.azure.com/my-project/_git/my-repo",
                 "git_provider": "azureDevOpsServices",
                 "git_branch": "master"
             },
             "format": "SINGLE_TASK"
        },
        "created_time": 1676477563075
    }

Solution

  • I figured out that you don't need to retrieve the entire workflow definition json file, as shown in step 1, but only the "settings" part, i.e. modifying step 1 to this solved my issue:

    databricks jobs get --job-id $(job_id) | jq .settings > workflow.json