kubernetes terraform kubernetes-helm azure-aks

terraform helm release resource fails but Helm command works

I am using HCP trying to deploy Grafana meta-monitoring and alloy helm chart using terraform helm_release and it is not successful and errors in context deadline passed but if I use helm command the helm chart is deployed without any error

resource "helm_release" "meta_monitoring" {
  name              = "meta-monitoring"
  repository        = "https://grafana.github.io/helm-charts"
  chart             = "meta-monitoring"
  version           = "1.3.0"
  namespace         = "meta"
  dependency_update = true
  depends_on        = [kubernetes_secret.minio]

  values = [
    "${file("./helm_charts_values/meta/values.yaml")}"
  ]
}

resource "helm_release" "alloy" {
  name       = "alloy"
  repository = "https://grafana.github.io/helm-charts"
  chart      = "alloy"
  version    = "0.10.0"
  namespace  = "soc-loki"
  depends_on = [kubernetes_namespace.loki]

  values = [
    "${file("./helm_charts_values/alloy/values.yaml")}"
  ]
}

dependent resource have no problem

resource "kubernetes_namespace" "meta" {
  metadata {
    name = "meta"
  }
}

resource "kubernetes_secret" "minio" {
  metadata {
    name      = "minio"
    namespace = kubernetes_namespace.meta.metadata[0].name
  }

  data = {
    rootUser     = "xxxxxx"
    rootPassword = "xxxxx"
  }

  type = "Opaque"
}

If I use helm command directly it works

helm upgrade --install meta-monitoring grafana/meta-monitoring -n meta -f meta/values.yaml

There is no connection problem to aks as other charts are being deployed successfully

Values.yaml for reference for meta-monitoring

namespacesToMonitor:
- soc-loki

cloud:
  logs:
    enabled: false
  metrics:
    enabled: false
  traces:
    enabled: false

local:
  grafana:
    enabled: true
  logs:
    enabled: true
  metrics:
    enabled: true
  traces:
    enabled: true
  minio:
    enabled: true

Other chart which is successful

resource "helm_release" "alert_manager" {
  name       = "alertmanager"
  repository = "https://prometheus-community.github.io/helm-charts"
  chart      = "alertmanager"
  version    = "1.13.1"
  namespace  = "soc-loki"
  depends_on = [kubernetes_namespace.loki]

  values = [
    "${file("./helm_charts_values/alertmanager/values.yaml")}"
  ]
}

Solution

Here is how I solved it.

By setting wait=false explicitly allows the helm release to be marked as successful as well as all the resource of helm chart are running perfectly for example pods in the ready state

resource "helm_release" "alloy" {
  name       = "alloy"
  repository = "https://grafana.github.io/helm-charts"
  chart      = "alloy"
  version    = "0.10.0"
  namespace  = "soc-loki"
  depends_on = [kubernetes_namespace.loki, helm_release.meta_monitoring]
  wait       = false

  values = [
    "${file("./helm_charts_values/alloy/values.yaml")}"
  ]
}

resource "helm_release" "meta_monitoring" {
  name              = "meta-monitoring"
  repository        = "https://grafana.github.io/helm-charts"
  chart             = "meta-monitoring"
  version           = "1.3.0"
  namespace         = "meta"
  dependency_update = true
  depends_on        = [kubernetes_secret.minio]
  wait              = false

  values = [
    "${file("./helm_charts_values/meta/values.yaml")}"
  ]
}

As per docs, wait

Will wait until all resources are in a ready state before marking the release as successful. Defaults to true.

Here is my guess what is happening

There might be some jobs related to alloy and meta monitoring which might be set to run post-install. When I use manual helm command all the resource take around 2-3 minutes to be in ready state
Post install jobs will not launch until the release has been marked successful.
If wait=true in the helm_release, terraform does not mark the release as successful until all resources are in a ready state.This effectively creates a deadlock when resources depend on the result of the running jobs.

TO understand more I might need to go in detail how meta-monitoring and allow helm charts are being deployed to understand the floe but this is the closest guess for me

Ref: https://github.com/hashicorp/terraform-provider-helm/issues/683#issuecomment-830872443