I am deploying a databricks asset bundle via Azure DevOps pipeline using databricks bundle deploy -t <target_env>
from the databricks-cli
. Deployment works fine no issues, however the problem when the job is run it fails to authenticate to fetch code from Azure Git. It appears there is no way to set Microsoft Entra ID (formerly Azure Active Directory) authentication when the notebook_task
has it's source
set to GIT
- I am using Git refence for the code.
I tried setting git_provider
to azureDevOpsServicesAad, but only valid value for Azure DevOps is azureDevOpsServices.
The job is configured to run as a service principal, which has the necessary read/write permissions on the Azure Git Repo.
I understand, I can use the Databricks REST Api to create a new git credential after obtaining an access token for the job service principal, but I see no way (after looking at the yaml schema) to specify git credential id in the asset bundle yaml. There is no git_credential_id
property for the git_source
section?
Below is my yaml configuration for the databricks asset bundle job:
resources:
jobs:
my_job_name:
name: my_job_name
run_as:
service_principal_name: ${var.run_as_user}
permissions:
- group_name: ${var.workspace_user_group}
level: CAN_MANAGE_RUN
- group_name: ${var.workspace_admin_group}
level: CAN_MANAGE
deployment_config:
no_package: true
git_source:
git_provider: azureDevOpsServices
git_url: https://dev.azure.com/<org-name>/<proj-name>/_git/${var.source_repo_name}
git_branch: ${var.source_branch_name}
max_concurrent_runs: 10
tasks:
- task_key: Run_All_Notebooks
notebook_task:
notebook_path: ${bundle.name}/_run_all_notebooks
source: GIT
existing_cluster_id: ${var.shared_autoscaling_cluster_id}
After deployment when I run the job in the Databricks UI, the error reported is:
Task Run_All_Notebooks failed with message: Failed to checkout Git repository: PERMISSION_DENIED: Encountered an error with your Azure Active Directory credentials. Please try logging out of Azure Active Directory (https://portal.azure.com) and logging back in. This caused all downstream tasks to get skipped.
Is there any way to explicitly set AAD authentication for the git_provider
or explicitly specify git credentials for the service principal under which the job is set to run?
I ended up using WORKSPACE
as value for the source
in a notebook_task
.
It was possible to use GIT
, but I didn't want to store secrets in the databricks workspace (for more details refer to this post here). Below is high-level overview of my process:
bundle deploy ...
where the source
in the YAML is set to WORKSPACE
pointing to the code in the updated databricks repo in previous stepI hope this is helpful to other lost souls looking for solution to the problem :)