pythonazureazure-machine-learning-servicekubernetes-deploymentazureml-python-sdk

Azure ML: DeploymentIdentityError: Failed to create Kubernetes deployment identity, Reason:RefreshExtensionIdentityNotSet


I'm using Azure Machine Learning v2 SDK to create a model deployment on a kubernetes compute attached to an AML workspace. I'm able to deploy it locally as part of testing before deploying online. However, when tried to deploy online using KubernetesOnlineDeplyoment, I received DeploymentIdentityError: Failed to create Kubernetes deployment identity, Reason:RefreshExtensionIdentityNotSet. (More detailed error below)

I'm provisioned the AKS cluster using terraform.

I referred this official tutorial notebook as well. I've tried the local deployment flow mentioned in the tutorial and it works fine.

In the tutorial, section 4.3 Attach Arc Cluster, I modified the compute_params dict to include identity as well.

Below is the code I used to attach the cluster:

compute = "testfooamlXXXX-c"

from azure.ai.ml import load_compute

compute_params = [
    {"name": compute},
    {"type": "kubernetes"},
    {
        "resource_id": "/subscriptions/XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX/resourceGroups/test-foo-aml/providers/Microsoft.ContainerService/managedClusters/testfooamlXXXX",
    },
    {"identity": {"type":"SystemAssigned"}},  # This is the line I added
]
k8s_compute = load_compute(source=None, params_override=compute_params)

Below is how I'm creating an endpoint first:

endpoint = KubernetesOnlineEndpoint(
    name=endpoint_name,
    compute=compute,
    description="this is a sample online endpoint",
    auth_mode="key",
    tags={"foo": "bar"},
)
ml_client.begin_create_or_update(endpoint).result()

Then I created the deployment object,

blue_deployment = KubernetesOnlineDeployment(
    name="blue",
    endpoint_name=endpoint_name,
    model=Model(path=str(model_path)),
    environment=Environment(name=env_name, version=env_version),
    code_configuration=CodeConfiguration(
        code=str(model_script_path.parent), scoring_script=model_script_path.name
    ),
    instance_count=1,
)

Finally, it's the below line that causes the issue:

ml_client.begin_create_or_update(blue_deployment).result()

The Error:

---------------------------------------------------------------------------
OperationFailed                           Traceback (most recent call last)
File ~/miniconda3/envs/rishabh/lib/python3.11/site-packages/azure/core/polling/base_polling.py:757, in LROBasePolling.run(self)
    756 try:
--> 757     self._poll()
    759 except BadStatus as err:

File ~/miniconda3/envs/rishabh/lib/python3.11/site-packages/azure/core/polling/base_polling.py:789, in LROBasePolling._poll(self)
    788 if _failed(self.status()):
--> 789     raise OperationFailed("Operation failed or canceled")
    791 final_get_url = self._operation.get_final_get_url(self._pipeline_response)

OperationFailed: Operation failed or canceled

The above exception was the direct cause of the following exception:

HttpResponseError                         Traceback (most recent call last)
/home/rishabh/aml/test-foo-model-deployments.ipynb Cell 51 line 1
----> 1 ml_client.begin_create_or_update(blue_deployment).result()

File ~/miniconda3/envs/rishabh/lib/python3.11/site-packages/azure/core/polling/_poller.py:251, in LROPoller.result(self, timeout)
    242 def result(self, timeout: Optional[float] = None) -> PollingReturnType_co:
    243     """Return the result of the long running operation, or
    244     the result available after the specified timeout.
    245 
   (...)
    249     :raises ~azure.core.exceptions.HttpResponseError: Server problem with the query.
    250     """
--> 251     self.wait(timeout)
    252     return self._polling_method.resource()

File ~/miniconda3/envs/rishabh/lib/python3.11/site-packages/azure/core/tracing/decorator.py:78, in distributed_trace.<locals>.decorator.<locals>.wrapper_use_tracer(*args, **kwargs)
     76 span_impl_type = settings.tracing_implementation()
     77 if span_impl_type is None:
---> 78     return func(*args, **kwargs)
     80 # Merge span is parameter is set, but only if no explicit parent are passed
     81 if merge_span and not passed_in_parent:

File ~/miniconda3/envs/rishabh/lib/python3.11/site-packages/azure/core/polling/_poller.py:270, in LROPoller.wait(self, timeout)
    266 self._thread.join(timeout=timeout)
    267 try:
    268     # Let's handle possible None in forgiveness here
    269     # https://github.com/python/mypy/issues/8165
--> 270     raise self._exception  # type: ignore
    271 except TypeError:  # Was None
    272     pass

File ~/miniconda3/envs/rishabh/lib/python3.11/site-packages/azure/core/polling/_poller.py:185, in LROPoller._start(self)
    181 """Start the long running operation.
    182 On completion, runs any callbacks.
    183 """
    184 try:
--> 185     self._polling_method.run()
    186 except AzureError as error:
    187     if not error.continuation_token:

File ~/miniconda3/envs/rishabh/lib/python3.11/site-packages/azure/core/polling/base_polling.py:772, in LROBasePolling.run(self)
    765     raise HttpResponseError(
    766         response=self._pipeline_response.http_response,
    767         message=str(err),
    768         error=err,
    769     ) from err
    771 except OperationFailed as err:
--> 772     raise HttpResponseError(response=self._pipeline_response.http_response, error=err) from err

HttpResponseError: (None) DeploymentIdentityError: Failed to create Kubernetes deployment identity, Reason:RefreshExtensionIdentityNotSet Details:Managed identity of AzureML extension is not assigned to the node pool of 'aks-default-XXXXXXXX-vmss000000'. The identity is used to give access for user container, such as pull image from ACR. Please see troubleshooting guide, available here: https://aka.ms/amlarc-tsg
Code: None
Message: DeploymentIdentityError: Failed to create Kubernetes deployment identity, Reason:RefreshExtensionIdentityNotSet Details:Managed identity of AzureML extension is not assigned to the node pool of 'aks-default-XXXXXXXX-vmss000000'. The identity is used to give access for user container, such as pull image from ACR. Please see troubleshooting guide, available here: https://aka.ms/amlarc-tsg
Exception Details:  (None) DeploymentIdentityError: Failed to create Kubernetes deployment identity, Reason:RefreshExtensionIdentityNotSet Details:Managed identity of AzureML extension is not assigned to the node pool of 'aks-default-XXXXXXXX-vmss000000'. The identity is used to give access for user container, such as pull image from ACR. Please see troubleshooting guide, available here: https://aka.ms/amlarc-tsg
    Code: None
    Message: DeploymentIdentityError: Failed to create Kubernetes deployment identity, Reason:RefreshExtensionIdentityNotSet Details:Managed identity of AzureML extension is not assigned to the node pool of 'aks-default-XXXXXXXX-vmss000000'. The identity is used to give access for user container, such as pull image from ACR. Please see troubleshooting guide, available here: https://aka.ms/amlarc-tsg

Regarding the error, quoting the official documentation:

ERROR: RefreshExtensionIdentityNotSet This error occurs when the extension is installed but the extension identity is not correctly assigned. You can try to reinstall the extension to fix it.

I tried re-installing the extension and deploying but got the same error.


Solution

  • Seems like the Azure ML Extension's deployment identity-controller was being interfered by aad-pod-identity. Removing aad-pod-identity resolved the issue.