I'm using Azure Machine Learning v2 SDK to create a model deployment on a kubernetes compute attached to an AML workspace. I'm able to deploy it locally as part of testing before deploying online. However, when tried to deploy online using KubernetesOnlineDeplyoment
, I received DeploymentIdentityError: Failed to create Kubernetes deployment identity, Reason:RefreshExtensionIdentityNotSet
. (More detailed error below)
I'm provisioned the AKS cluster using terraform.
I referred this official tutorial notebook as well. I've tried the local deployment flow mentioned in the tutorial and it works fine.
In the tutorial, section 4.3 Attach Arc Cluster, I modified the compute_params
dict to include identity
as well.
Below is the code I used to attach the cluster:
compute = "testfooamlXXXX-c"
from azure.ai.ml import load_compute
compute_params = [
{"name": compute},
{"type": "kubernetes"},
{
"resource_id": "/subscriptions/XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX/resourceGroups/test-foo-aml/providers/Microsoft.ContainerService/managedClusters/testfooamlXXXX",
},
{"identity": {"type":"SystemAssigned"}}, # This is the line I added
]
k8s_compute = load_compute(source=None, params_override=compute_params)
Below is how I'm creating an endpoint first:
endpoint = KubernetesOnlineEndpoint(
name=endpoint_name,
compute=compute,
description="this is a sample online endpoint",
auth_mode="key",
tags={"foo": "bar"},
)
ml_client.begin_create_or_update(endpoint).result()
Then I created the deployment object,
blue_deployment = KubernetesOnlineDeployment(
name="blue",
endpoint_name=endpoint_name,
model=Model(path=str(model_path)),
environment=Environment(name=env_name, version=env_version),
code_configuration=CodeConfiguration(
code=str(model_script_path.parent), scoring_script=model_script_path.name
),
instance_count=1,
)
Finally, it's the below line that causes the issue:
ml_client.begin_create_or_update(blue_deployment).result()
The Error:
---------------------------------------------------------------------------
OperationFailed Traceback (most recent call last)
File ~/miniconda3/envs/rishabh/lib/python3.11/site-packages/azure/core/polling/base_polling.py:757, in LROBasePolling.run(self)
756 try:
--> 757 self._poll()
759 except BadStatus as err:
File ~/miniconda3/envs/rishabh/lib/python3.11/site-packages/azure/core/polling/base_polling.py:789, in LROBasePolling._poll(self)
788 if _failed(self.status()):
--> 789 raise OperationFailed("Operation failed or canceled")
791 final_get_url = self._operation.get_final_get_url(self._pipeline_response)
OperationFailed: Operation failed or canceled
The above exception was the direct cause of the following exception:
HttpResponseError Traceback (most recent call last)
/home/rishabh/aml/test-foo-model-deployments.ipynb Cell 51 line 1
----> 1 ml_client.begin_create_or_update(blue_deployment).result()
File ~/miniconda3/envs/rishabh/lib/python3.11/site-packages/azure/core/polling/_poller.py:251, in LROPoller.result(self, timeout)
242 def result(self, timeout: Optional[float] = None) -> PollingReturnType_co:
243 """Return the result of the long running operation, or
244 the result available after the specified timeout.
245
(...)
249 :raises ~azure.core.exceptions.HttpResponseError: Server problem with the query.
250 """
--> 251 self.wait(timeout)
252 return self._polling_method.resource()
File ~/miniconda3/envs/rishabh/lib/python3.11/site-packages/azure/core/tracing/decorator.py:78, in distributed_trace.<locals>.decorator.<locals>.wrapper_use_tracer(*args, **kwargs)
76 span_impl_type = settings.tracing_implementation()
77 if span_impl_type is None:
---> 78 return func(*args, **kwargs)
80 # Merge span is parameter is set, but only if no explicit parent are passed
81 if merge_span and not passed_in_parent:
File ~/miniconda3/envs/rishabh/lib/python3.11/site-packages/azure/core/polling/_poller.py:270, in LROPoller.wait(self, timeout)
266 self._thread.join(timeout=timeout)
267 try:
268 # Let's handle possible None in forgiveness here
269 # https://github.com/python/mypy/issues/8165
--> 270 raise self._exception # type: ignore
271 except TypeError: # Was None
272 pass
File ~/miniconda3/envs/rishabh/lib/python3.11/site-packages/azure/core/polling/_poller.py:185, in LROPoller._start(self)
181 """Start the long running operation.
182 On completion, runs any callbacks.
183 """
184 try:
--> 185 self._polling_method.run()
186 except AzureError as error:
187 if not error.continuation_token:
File ~/miniconda3/envs/rishabh/lib/python3.11/site-packages/azure/core/polling/base_polling.py:772, in LROBasePolling.run(self)
765 raise HttpResponseError(
766 response=self._pipeline_response.http_response,
767 message=str(err),
768 error=err,
769 ) from err
771 except OperationFailed as err:
--> 772 raise HttpResponseError(response=self._pipeline_response.http_response, error=err) from err
HttpResponseError: (None) DeploymentIdentityError: Failed to create Kubernetes deployment identity, Reason:RefreshExtensionIdentityNotSet Details:Managed identity of AzureML extension is not assigned to the node pool of 'aks-default-XXXXXXXX-vmss000000'. The identity is used to give access for user container, such as pull image from ACR. Please see troubleshooting guide, available here: https://aka.ms/amlarc-tsg
Code: None
Message: DeploymentIdentityError: Failed to create Kubernetes deployment identity, Reason:RefreshExtensionIdentityNotSet Details:Managed identity of AzureML extension is not assigned to the node pool of 'aks-default-XXXXXXXX-vmss000000'. The identity is used to give access for user container, such as pull image from ACR. Please see troubleshooting guide, available here: https://aka.ms/amlarc-tsg
Exception Details: (None) DeploymentIdentityError: Failed to create Kubernetes deployment identity, Reason:RefreshExtensionIdentityNotSet Details:Managed identity of AzureML extension is not assigned to the node pool of 'aks-default-XXXXXXXX-vmss000000'. The identity is used to give access for user container, such as pull image from ACR. Please see troubleshooting guide, available here: https://aka.ms/amlarc-tsg
Code: None
Message: DeploymentIdentityError: Failed to create Kubernetes deployment identity, Reason:RefreshExtensionIdentityNotSet Details:Managed identity of AzureML extension is not assigned to the node pool of 'aks-default-XXXXXXXX-vmss000000'. The identity is used to give access for user container, such as pull image from ACR. Please see troubleshooting guide, available here: https://aka.ms/amlarc-tsg
Regarding the error, quoting the official documentation:
ERROR: RefreshExtensionIdentityNotSet This error occurs when the extension is installed but the extension identity is not correctly assigned. You can try to reinstall the extension to fix it.
I tried re-installing the extension and deploying but got the same error.
Seems like the Azure ML Extension's deployment identity-controller
was being interfered by aad-pod-identity
. Removing aad-pod-identity
resolved the issue.