pythondatabricksdatabricks-asset-bundle

How to configure pypi repo authentication for an Azure DevOps Artifact Feed in databricks.yml for Databricks Asset Bundles?


I have a python_wheel_task in one of my asset bundle jobs which executes the whl file that is being built from my local repository from which I deploy the bundle. This process works fine in itself.

However - I need to add a custom dependency whl file (another repo, packaged and published to my Azure Artifact Feed) to the task as a library in order for my local repo's whl file to work completely.

I tried to define it as follows:

    - task_key: some_task
      job_cluster_key: job_cluster
      python_wheel_task:
        package_name: my_local_package_name
        entry_point: my_entrypoint
        named_parameters: { "env": "dev" }
      libraries:
        - pypi:
            package: custom_package==1.0.1
            repo: https://pkgs.dev.azure.com/<company>/<some-id>/_packaging/<feed-name>/pypi/simple/
        - whl: ../../dist/*.whl  # my local repo's whl: being built as part of the asset-bundle
           

When I deploy and run the bundle, I get the following error in the job cluster:

24/07/12 07:49:01 ERROR Utils: 
Process List(/bin/su, libraries, -c, bash /local_disk0/.ephemeral_nfs/cluster_libraries/python/python_start_clusterwide.sh
/local_disk0/.ephemeral_nfs/cluster_libraries/python/bin/pip install 'custom_package==3.0.1' 
--index-url https://pkgs.dev.azure.com/<company>/<some-id>/_packaging/<feed-name>/pypi/simple/ 
--disable-pip-version-check) exited with code 1, and Looking in indexes:
https://pkgs.dev.azure.com/<company>/<some-id>/_packaging/<feed-name>/pypi/simple/

24/07/12 07:49:01 INFO SharedDriverContext: Failed to attach library 
python-pypi;custom_package;;3.0.1;https://pkgs.dev.azure.com/<company>/<some-id>/_packaging/<feed-name>/pypi/simple/ 
to Spark

I suppose I need to configure a personal access token / authentication for the feed somewhere, but I cannot find anything in the Databricks documentation about library dependencies. There is only one sentence about adding a custom index and nothing about authentication.

How can I get this to work?


Solution

  • Best practice solution for existing all-purpose clusters

    I managed to use a combination of an existing cluster, a cluster environment variable and init script to configure the cluster for authentication against a custom PyPI index:

    1. I stored an Azure DevOps PAT in my KeyVault

    2. I created a secret-scope in Databricks for that KeyVault

    3. I uploaded/imported the init-script in Databricks to Workspace/Shared/init-scripts/set-private-artifact-feed.sh

    4. I created an all-purpose cluster and set under configuration -> Advanced options:

    5. Environment variable: PYPI_TOKEN={{secrets/<my-scope>/<secret-name-of-devops-pat>}}

    6. Init Scripts: Type Workspace, File path /Shared/init-scripts/set-private-artifact-feed.sh

    Contents of set-private-artifact-feed.sh:

    #!/bin/bash
    if [[ $PYPI_TOKEN ]]; then
       use $PYPI_TOKEN
    fi
    echo $PYPI_TOKEN
    printf "[global]\n" > /etc/pip.conf
    printf "extra-index-url =\n" >> /etc/pip.conf
    printf "\thttps://$PYPI_TOKEN@pkgs.dev.azure.com/<company>/<some-id>/_packaging/<feed-name>/pypi/simple/\n" >> /etc/pip.conf
    
       
    

    After restarting the cluster, I could run my task as I initially defined, the authentication against the index works now. More details in this medium article.

    Note that this does not work with a job cluster unless you also pass along the reference to the init script & set the environment variable! Using an all-purpose cluster makes more sense to me.

    Hacky solution for job-clusters

    We can add the PYPI token to the repo URL. I was unable to set the init-scripts / env-variables for the job-clusters properly to get it to work otherwise.

    - pypi:
       package: pyspark-framework==4.0.0
       repo: https://<YOUR-TOKEN-HERE>@pkgs.dev.azure.com/<company>/<some-id>/_packaging/<feed-name>/pypi/simple/
    - whl: ../dist/*.whl
    

    This hacky solution is a major security risk: the token will show up as plain text in your databricks workspace for anyone to see!