I'm trying to modify some prometheusrule in my cluster but I'm encountering a timeout error that I don't understand. Here is a sample of the rule I'm trying to modify.
# modif.yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
annotations:
meta.helm.sh/release-name: monitoring-platform
meta.helm.sh/release-namespace: monitoring
creationTimestamp: "2022-01-04T09:20:58Z"
generation: 1
labels:
app: prometheus-operator
app.kubernetes.io/managed-by: Helm
prometheus: kube-op
release: monitoring-platform
name: kube-op-apps-rules
namespace: monitoring
resourceVersion: "948572193"
uid: a461d478-9e61-4004-a129-9ed3f5efe8b0
spec:
groups:
- name: kubernetes-apps
rules:
- alert: KubePodCrashLooping
annotations:
message: Pod {{ $labels.namespace }}/{{ $labels.pod }} ({{ $labels.container
}}) is restarting {{ printf "%.2f" $value }} times / 20 minutes.
runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubepodcrashlooping
expr: rate(kube_pod_container_status_restarts_total{job="kube-state-metrics",
pod!~"social-reco-prod.*"}[15m]) * 60 * 20 > 0
for: 1h
labels:
severity: critical
If I try to apply this file I get the following error
09:14:23 bastien@work:/work/$ kubectl -v 6 apply -f modif.yaml
I0103 09:14:25.389081 1969468 loader.go:379] Config loaded from file: /home/bastien/.kube/config
I0103 09:14:25.436584 1969468 round_trippers.go:445] GET https://1.2.3.4/openapi/v2?timeout=32s 200 OK in 46 milliseconds
I0103 09:14:25.655706 1969468 round_trippers.go:445] GET https://1.2.3.4/apis/external.metrics.k8s.io/v1beta1?timeout=32s 200 OK in 14 milliseconds
I0103 09:14:25.664871 1969468 cached_discovery.go:82] skipped caching discovery info, no resources found
I0103 09:14:25.696947 1969468 round_trippers.go:445] GET https://1.2.3.4/apis/monitoring.coreos.com/v1/namespaces/monitoring/prometheusrules/kube-op-apps-rules 404 Not Found in 30 milliseconds
I0103 09:14:25.728817 1969468 round_trippers.go:445] GET https://1.2.3.4/api/v1/namespaces/monitoring 200 OK in 31 milliseconds
I0103 09:14:59.759927 1969468 round_trippers.go:445] POST https://1.2.3.4/apis/monitoring.coreos.com/v1/namespaces/monitoring/prometheusrules?fieldManager=kubectl-client-side-apply 504 Gateway Timeout in 34030 milliseconds
The rule seems to be good since I have a secondary cluster on which the apply command is working fine. Both cluster have the same version :
# The malfunctioning one
12:37:34 bastien@work:/work/$ kubectl version
Client Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.5", GitCommit:"6b1d87acf3c8253c123756b9e61dac642678305f", GitTreeState:"clean", BuildDate:"2021-03-18T01:10:43Z", GoVersion:"go1.15.8", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.14-gke.4300", GitCommit:"348bdc1040d273677ca07c0862de867332eeb3a1", GitTreeState:"clean", BuildDate:"2022-08-17T09:22:54Z", GoVersion:"go1.16.15b7", Compiler:"gc", Platform:"linux/amd64"}
# The working one
13:25:07 bastien@work:/work/$ kubectl version
Client Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.5", GitCommit:"6b1d87acf3c8253c123756b9e61dac642678305f", GitTreeState:"clean", BuildDate:"2021-03-18T01:10:43Z", GoVersion:"go1.15.8", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.14-gke.4300", GitCommit:"348bdc1040d273677ca07c0862de867332eeb3a1", GitTreeState:"clean", BuildDate:"2022-08-17T09:22:54Z", GoVersion:"go1.16.15b7", Compiler:"gc", Platform:"linux/amd64"}
Do you have any clue what's wrong? Or at least where could I find some logs or info on what's going on?
I finally stumbled upon someone who had an issue which looked like mine.
https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/8303
In this thread the OP did several tests until someone pointed out a potential solution that consists in allowing GKE master to communicate with kubelet.
https://github.com/prometheus-operator/prometheus-operator/issues/2711#issuecomment-521103022
The proposed terraform is the following :
resource "google_compute_firewall" "gke-master-to-kubelet" {
name = "k8s-master-to-kubelets"
network = "XXXXX"
project = "XXXXX"
description = "GKE master to kubelets"
source_ranges = ["${data.terraform_remote_state.network.master_ipv4_cidr_block}"]
allow {
protocol = "tcp"
ports = ["8443"]
}
target_tags = ["gke-main"]
}
Once I added this firewall rule on my side it completely fix my issue. I still don't know why it suddenly stop working.