gitazureazure-devopsazure-databricksdatabricks-asset-bundle

Databricks Asset Bundle - git authentication when resources job notebook_task has source set to GIT


I am deploying a databricks asset bundle via Azure DevOps pipeline using databricks bundle deploy -t <target_env> from the databricks-cli. Deployment works fine no issues, however the problem when the job is run it fails to authenticate to fetch code from Azure Git. It appears there is no way to set Microsoft Entra ID (formerly Azure Active Directory) authentication when the notebook_task has it's source set to GIT - I am using Git refence for the code.

I tried setting git_provider to azureDevOpsServicesAad, but only valid value for Azure DevOps is azureDevOpsServices.

The job is configured to run as a service principal, which has the necessary read/write permissions on the Azure Git Repo.

I understand, I can use the Databricks REST Api to create a new git credential after obtaining an access token for the job service principal, but I see no way (after looking at the yaml schema) to specify git credential id in the asset bundle yaml. There is no git_credential_id property for the git_source section?

Below is my yaml configuration for the databricks asset bundle job:

resources:
  jobs:
    my_job_name:
      name: my_job_name

      run_as:
        service_principal_name: ${var.run_as_user}

      permissions:
        - group_name: ${var.workspace_user_group}
          level: CAN_MANAGE_RUN
        - group_name: ${var.workspace_admin_group}
          level: CAN_MANAGE

      deployment_config:
        no_package: true

      git_source:
        git_provider: azureDevOpsServices
        git_url: https://dev.azure.com/<org-name>/<proj-name>/_git/${var.source_repo_name}
        git_branch: ${var.source_branch_name}

      max_concurrent_runs: 10
      tasks:
        - task_key: Run_All_Notebooks
          notebook_task:
            notebook_path: ${bundle.name}/_run_all_notebooks
            source: GIT
          existing_cluster_id: ${var.shared_autoscaling_cluster_id}

After deployment when I run the job in the Databricks UI, the error reported is:

Task Run_All_Notebooks failed with message: Failed to checkout Git repository: PERMISSION_DENIED: Encountered an error with your Azure Active Directory credentials. Please try logging out of Azure Active Directory (https://portal.azure.com) and logging back in. This caused all downstream tasks to get skipped.

Is there any way to explicitly set AAD authentication for the git_provider or explicitly specify git credentials for the service principal under which the job is set to run?


Solution

  • I ended up using WORKSPACE as value for the source in a notebook_task. It was possible to use GIT, but I didn't want to store secrets in the databricks workspace (for more details refer to this post here). Below is high-level overview of my process:

    I hope this is helpful to other lost souls looking for solution to the problem :)