azureazure-bicep

How do I avoid "Couldn't find a metric" when creating a metric alert in Bicep


I have the Bicep code below, to create an Azure SQL Database and a Metric Alert to inform me when the disk space utilisation hits 80%.

Deploying the Bicep when the database already exists works perfectly. However, when the database and the alert both need to be created, the deployment fails with the error "Couldn't find a metric named storage_percent".

Whenever our pipeline runs against an empty Resource Group, the deployment fails - but then our automatic retry succeeds. (This is not an acceptable situation since this code is part of a larger deployment which takes around 10 minutes to complete.)

The Bicep code is as follows:

resource sqlServer 'Microsoft.Sql/servers@2021-02-01-preview' existing = {
  name: sqlServerName

  resource sqlDatabase 'databases' = {
    name: sqlDatabaseName
    properties: {
      // ...
    }
    // ...
  }
}

resource diskSpaceAlert 'Microsoft.Insights/metricAlerts@2018-03-01' = {
  name: '${sqlServer::sqlDatabase.name} Disk Space Low'
  location: 'global'
  properties: {
    description: '${sqlServer::sqlDatabase.name} database disk usage is above 80%'
    criteria: {
      'odata.type': 'Microsoft.Azure.Monitor.SingleResourceMultipleMetricCriteria'
      allOf: [
        {
          name: 'StoragePercentGt80'
          metricName: 'storage_percent'
          threshold: 80
          operator: 'GreaterThan'
          timeAggregation: 'Maximum'
          criterionType: 'StaticThresholdCriterion'
          skipMetricValidation: true  // Added in attempt to solve problem
        }
      ]
    }
    enabled: true
    evaluationFrequency: 'PT1H'
    scopes: [
      sqlServer::sqlDatabase.id
    ]
    severity: 2
    windowSize: 'PT1H'
    actions: [
      {
        actionGroupId: errorAlertingActionGroupId
      }
    ]
  }
}

The raw error shown in the Azure Portal is:

{
  "code": "DeploymentFailed",
  "target": "/subscriptions/.../resourceGroups/.../providers/Microsoft.Resources/deployments/...",
  "message": "At least one resource deployment operation failed. Please list deployment operations for details. Please see https://aka.ms/arm-deployment-operations for usage details.",
  "details": [
    {
      "code": "BadRequest",
      "target": "/subscriptions/.../resourceGroups/.../providers/Microsoft.Resources/deployments/...",
      "message": "Couldn't find a metric named storage_percent. Make sure the name is correct. Activity ID: ..."
    }
  ]
}

Reading the documentation, I hoped that adding skipMetricValidation: true would solve the problem, but this seems to have had no impact.

We are deploying in UK South.

Potentially of relevance, all our resources are accessible only from within Virtual Networks.


Solution

  • This is most likely an issue in the SQL Database Resource Provider. We see this problem regularly, and only with the storage_percent metric.

    The resource provider is reporting to the deployment engine that it has successfully completed and the engine is free to continue on with dependent deployments, even though some portions of the SQL DB under the hood are still being provisioned or configured. It creates the opportunity for this race condition when a dependent deployment tries to then work with the resource that is still pending in some way.

    Manually adding the dependsOn, as the other answer suggests, won't help. Bicep will already create this under the hood when you use the resourceId, and the RP will still just say the dependsOn is satisfied.

    One workaround is to manually create a delay, typically by running an ARM deployment script with a simple wait operation in a shell script, and making your alerts bicep depend on that deployment script. You could also make your alerts bicep depend on other things that you know take longer to buy yourself some time.

    There is a wait and retry feature that bicep is considering implementing, where there are other examples of RPs not responding properly, including a comment I've linked to showing that there are other configurations on a sql server that also can spark similar dependency issues, not just metric alerts.

    I'd also recommend submitting a ticket to Azure support if you have time.

    Note for other answers: setting skipMetricValidation does not help.