azureazure-machine-learning-serviceazure-rbac

User XXX does not have access to compute instance YYY. Azure Machine Learning


I have some terraform where the Service Principal is Owner of the Subscription. And it can create a compute instance on AML. I assigne a user and the user can connect to it.

But when I create myself, on the UI, a compute instance with the exact same settings from my terraform configuration, I fail to connect to it: User XXX does not have access to compute instance YYY Error Message

Here is the configuration: config_0 config_1

I have no acces to terminal/Jupyter/Vscode.

I have no idea why it does not work. When an other user create a compute instance and assign it to me, it also does not work. But with the service principal which is only Owner of the subscription, then the assigment works.

Here are my RBAC roles (on ressource group, not on subscription nor on Azure Machine Learning workspace Ressource):

The custom role of AZURE-Datascience-AML-dev-Contributor contains this:

{
    "id": "/subscriptions/xxxxxxxxxxxx/providers/Microsoft.Authorization/roleDefinitions/xxxxxxxxx",
    "properties": {
        "roleName": "AZURE-Datascience-AML-dev-Contributor",
        "description": "This role is used for AML",
        "assignableScopes": [
            "/subscriptions/xxxxxx/resourceGroups/rg-xxxxxxx-01"
        ],
        "permissions": [
            {
                "actions": [
                    "Microsoft.MachineLearningServices/workspaces/*/read",
                    "Microsoft.MachineLearningServices/workspaces/*/action",
                    "Microsoft.MachineLearningServices/workspaces/*/delete",
                    "Microsoft.MachineLearningServices/workspaces/*/write",
                    "Microsoft.Network/virtualNetworks/*/read",
                    "Microsoft.Network/virtualNetworks/subnets/join/action"
                ],
                "notActions": [],
                "dataActions": [],
                "notDataActions": []
            }
        ]
    }
}

In comparaison, here the terraform code that works (keep in mind that an SP is deploying it, not my user, and its only role is Owner)

# Create a compute instance for each user
resource "azurerm_machine_learning_compute_instance" "aml_compute_instance" {
  name                          = "${var.user.mail_nickname}-${var.context.environment}-A8M-V2"
  machine_learning_workspace_id = var.machine_learning_workspace_id
  virtual_machine_size          = "STANDARD_A8M_V2"
  identity {
    type = "UserAssigned"
    identity_ids = [
      azurerm_user_assigned_identity.aml_user_assigned_identity.id
    ]
  }
  assign_to_user {
    object_id = var.user.object_id
    tenant_id = nonsensitive(var.secrets.TENANT_ID)
  }
  node_public_ip_enabled = false
  subnet_resource_id     = var.machine_learning_subnet_id
  description            = "Compute instance generated by Terraform for : ${var.user.mail_nickname}"

  tags = var.tags

  depends_on = [
    module.keyvault_policy_aml_user_assigned_identity,
    module.roles_aml_user_assigned_identity
  ]
}

I use the same subnet for the deployment of the Workspace and the compute instances (and also clusters). I only user 20-30 ips in my /24 subnet for now.


Solution

  • After some troubleshooting with Azure Network support, we found out that in the terraform I had a null_ressource that was calling a DNS refresh dedicated for my subscription.

    So when I was creating a compute instance manually, the URL of the compute instance was not resolved by our DNS. So I had to upload the DNS to get the new route from the compute instances IP and it's name.

    So it was not an issue of permissions like the error says. The error message is basically a default error saying "Something went wrong, I can't find the URL, you can't access it, I don't know, but I say you don't have the privilege of accessing it". Which is totally missleading.