kuberneteskubectlprometheus-operator

Cannot apply prometheusrule caused by a gateway timeout


I'm trying to modify some prometheusrule in my cluster but I'm encountering a timeout error that I don't understand. Here is a sample of the rule I'm trying to modify.

# modif.yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  annotations:
    meta.helm.sh/release-name: monitoring-platform
    meta.helm.sh/release-namespace: monitoring
  creationTimestamp: "2022-01-04T09:20:58Z"
  generation: 1
  labels:
    app: prometheus-operator
    app.kubernetes.io/managed-by: Helm
    prometheus: kube-op
    release: monitoring-platform
  name: kube-op-apps-rules
  namespace: monitoring
  resourceVersion: "948572193"
  uid: a461d478-9e61-4004-a129-9ed3f5efe8b0
spec:
  groups:
  - name: kubernetes-apps
    rules:
    - alert: KubePodCrashLooping
      annotations:
        message: Pod {{ $labels.namespace }}/{{ $labels.pod }} ({{ $labels.container
          }}) is restarting {{ printf "%.2f" $value }} times / 20 minutes.
        runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubepodcrashlooping
      expr: rate(kube_pod_container_status_restarts_total{job="kube-state-metrics",
        pod!~"social-reco-prod.*"}[15m]) * 60 * 20 > 0
      for: 1h
      labels:
        severity: critical

If I try to apply this file I get the following error

09:14:23 bastien@work:/work/$ kubectl -v 6 apply -f modif.yaml
I0103 09:14:25.389081 1969468 loader.go:379] Config loaded from file:  /home/bastien/.kube/config
I0103 09:14:25.436584 1969468 round_trippers.go:445] GET https://1.2.3.4/openapi/v2?timeout=32s 200 OK in 46 milliseconds
I0103 09:14:25.655706 1969468 round_trippers.go:445] GET https://1.2.3.4/apis/external.metrics.k8s.io/v1beta1?timeout=32s 200 OK in 14 milliseconds
I0103 09:14:25.664871 1969468 cached_discovery.go:82] skipped caching discovery info, no resources found
I0103 09:14:25.696947 1969468 round_trippers.go:445] GET https://1.2.3.4/apis/monitoring.coreos.com/v1/namespaces/monitoring/prometheusrules/kube-op-apps-rules 404 Not Found in 30 milliseconds
I0103 09:14:25.728817 1969468 round_trippers.go:445] GET https://1.2.3.4/api/v1/namespaces/monitoring 200 OK in 31 milliseconds
I0103 09:14:59.759927 1969468 round_trippers.go:445] POST https://1.2.3.4/apis/monitoring.coreos.com/v1/namespaces/monitoring/prometheusrules?fieldManager=kubectl-client-side-apply 504 Gateway Timeout in 34030 milliseconds

The rule seems to be good since I have a secondary cluster on which the apply command is working fine. Both cluster have the same version :

# The malfunctioning one
12:37:34 bastien@work:/work/$ kubectl version
Client Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.5", GitCommit:"6b1d87acf3c8253c123756b9e61dac642678305f", GitTreeState:"clean", BuildDate:"2021-03-18T01:10:43Z", GoVersion:"go1.15.8", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.14-gke.4300", GitCommit:"348bdc1040d273677ca07c0862de867332eeb3a1", GitTreeState:"clean", BuildDate:"2022-08-17T09:22:54Z", GoVersion:"go1.16.15b7", Compiler:"gc", Platform:"linux/amd64"}

# The working one
13:25:07 bastien@work:/work/$ kubectl version
Client Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.5", GitCommit:"6b1d87acf3c8253c123756b9e61dac642678305f", GitTreeState:"clean", BuildDate:"2021-03-18T01:10:43Z", GoVersion:"go1.15.8", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.14-gke.4300", GitCommit:"348bdc1040d273677ca07c0862de867332eeb3a1", GitTreeState:"clean", BuildDate:"2022-08-17T09:22:54Z", GoVersion:"go1.16.15b7", Compiler:"gc", Platform:"linux/amd64"}

Do you have any clue what's wrong? Or at least where could I find some logs or info on what's going on?


Solution

  • I finally stumbled upon someone who had an issue which looked like mine.

    https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/8303

    In this thread the OP did several tests until someone pointed out a potential solution that consists in allowing GKE master to communicate with kubelet.

    https://github.com/prometheus-operator/prometheus-operator/issues/2711#issuecomment-521103022

    The proposed terraform is the following :

    resource "google_compute_firewall" "gke-master-to-kubelet" {
      name    = "k8s-master-to-kubelets"
      network = "XXXXX"
      project = "XXXXX"
    
      description = "GKE master to kubelets"
    
      source_ranges = ["${data.terraform_remote_state.network.master_ipv4_cidr_block}"]
    
      allow {
        protocol = "tcp"
        ports    = ["8443"]
      }
    
      target_tags = ["gke-main"]
    }
    

    Once I added this firewall rule on my side it completely fix my issue. I still don't know why it suddenly stop working.