azureazure-machine-learning-serviceazure-sdk-python

How to access private blob container inside pipeline job - Azure SDKv2?


I have this example taken from the official documentation (the one you see below).

from azure.ai.ml import command, Input, MLClient, UserIdentityConfiguration, ManagedIdentityConfiguration
from azure.ai.ml.entities import Data
from azure.ai.ml.constants import AssetTypes, InputOutputModes
from azure.identity import DefaultAzureCredential

# Set your subscription, resource group and workspace name:
subscription_id = "<SUBSCRIPTION_ID>"
resource_group = "<RESOURCE_GROUP>"
workspace = "<AML_WORKSPACE_NAME>"

# connect to the AzureML workspace
ml_client = MLClient(
    DefaultAzureCredential(), subscription_id, resource_group, workspace
)

# ==============================================================
# You can set the identity you want to use in a job to access the data. Options include:
# identity = UserIdentityConfiguration() # Use the user's identity
# identity = ManagedIdentityConfiguration() # Use the compute target managed identity
# ==============================================================
# This example accesses public data, so we don't need an identity.
# You also set identity to None if you use a credential-based datastore
identity = None

# Set the input for the job:
inputs = {
    "input_data": Input(type=data_type, path=path, mode=mode)
}

# This command job uses the head Linux command to print the first 10 lines of the file
job = command(
    command="head ${{inputs.input_data}}",
    inputs=inputs,
    environment="azureml://registries/azureml/environments/sklearn-1.1/versions/4",
    compute="cpu-cluster",
    identity=identity,
)

# Submit the command
ml_client.jobs.create_or_update(job)

The above code is from the Azure documentation.

I have a case in which the below Input is fed to a pipeline job. However, my use case requires that the container is private.

Therefore, my issue is that the container name is private, so I cannot access it like in the tutorial above.

I understand that I need to assign some permissions, I already have a managed identity on the cluster on which the code runs, which has the following permissions:

  1. Owner and Contributor of the storage account which includes the storage blob container.

  2. Owner and Contributor of the machine learning workspace.

I have a custom input which looks like:

    Input(
                type="uri_folder",
                # Here is the problem, cannot access it in a job.     
                path="wasbs://container_name@storage_account.blob.core.windows.net/",
                mode="ro_mount"
          ),

What should I do in order to be able to access, like in the example input below, data inside a private container inside my job?


Solution

  • The issue was I had to add this role to the Managed Identity:

    Storage Blob Data Contributor
    

    (because I read and write to that Private Blob inside the Storage Account)