monitoringgoogle-kubernetes-enginestackdriver

Create an Incident and Notifications in Stackdriver when a GKE Workload has Issues


I have a gke cluster with some workloads that can have boot issues. is it possible to create a stackdriver notification when a workload runs into an issue.

For example: create an incident when CrashLoopBackOff is triggered, pods are unshedulable or the Workload Status is anything other than OK for 5 minutes.


Solution

  • You can use log-based metrics to track all the CrashLoopBackOff states in your pods, using the following advanced query:

    https://cloud.google.com/logging/docs/view/advanced-queries

    resource.type="k8s_pod"
    resource.labels.location="us-central1-a"
    resource.labels.cluster_name="standard-cluster-1"
    "myproject"
    jsonPayload.message="Back-off restarting failed container"
    resource.labels.pod_name:"myproject"
    

    Pods unschedulable might go into crashloopbackoff or not be deployed, which is only traceable at the API server.

    We need to consider that to make the log based metrics, it's necessary to adapt the labels depending on the monitoring version (whether you have legacy or non-legacy) - "non-legacy" monitoring & metrics are used in this example

    Create the metric via log-based metrics and you'll find them in Monitoring as logging/user/xxxx

    https://cloud.google.com/logging/docs/logs-based-metrics/

    When you have a metric created you can create an alert policy to notify you when the issue occurs.