[SOLVED] Using multiple autoscaling mechanisms to autoscale a K8s cluster

Using multiple autoscaling mechanisms to autoscale a K8s cluster

In a recent experiment, I tried to autoscale my K8s cluster using two mechanisms: KEDA and HPA (see below). I wanted to use HPA OOB resource metrics to scale my cluster based on pod resource utilization (memory and CPU) and KEDA to autoscale based on custom-metrics.

Even though my deployment succeeds and the cluster was healthy and functional. When autoscaling kicked in, the cluster went haywire! Pods were constantly being provisioned and then de-provisioned, this state continued on even after I stopped the traffic against the cluster. I had to wait for the cool-down periods before it went sane again.

I didn't find any official documentation on this topic, thus, asking here.

My questions:

Can a k8s cluster be configured to autoscale using multiple mechanisms?
If so, what did I do wrong?

This was on K8s version 1.15.11 and KEDA 1.4.1

apiVersion: keda.k8s.io/v1alpha1
kind: ScaledObject
metadata:
  name: {{ $fullName }}
  labels:
    deploymentName: {{ $fullName }}
    {{- include "deployment.labels" . | nindent 4 }}
spec:
  scaleTargetRef:
    deploymentName: {{ $fullName }}
  pollingInterval: {{ .Values.scaleobject.pollingInterval }}
  cooldownPeriod:  {{ .Values.scaleobject.cooldownPeriod }}
  minReplicaCount: {{ .Values.scaleobject.minReplicaCount }}
  maxReplicaCount: {{ .Values.scaleobject.maxReplicaCount }}   
  triggers:
  - type: prometheus
    metadata:
      serverAddress: {{ tpl .Values.scaleobject.serverAddress . | quote }}  
      metricName: access_frequency
      threshold: "{{ .Values.scaleobject.threshold }}"
      query: {{ tpl .Values.scaleobject.query . | quote  }}
---
apiVersion: autoscaling/v2beta2
kind: HorizontalPodAutoscaler
metadata:
  name: resource-utilization-scaling
  namespace: default
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: {{ $fullName }}
  minReplicas: {{ .Values.scaleobject.minReplicaCount }}
  maxReplicas: {{ .Values.scaleobject.maxReplicaCount }}
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: {{ .Values.scaleobject.cpuUtilization }}
  - type: Resource
    resource:
      name: memory
      target:
        type: Utilization
        averageUtilization: {{ .Values.scaleobject.memUtilization }}

Solution

KEDA doesn't have direct cluster autoscaler support yet so you will have some unpredictability. In essence, you have two pieces of information that are not being shared that of KEDA and that of the cluster autoscaler and some of those may not agree at a particular time.

Best in my opinion to slow down your autoscaling overall of everything so that it allows all the autoscaler to catch up with any discrepancy. For example, you can make use of things like cooldown in an autoscaling group to avoid some resource starvation.

✌️