kubernetesapache-kafkakeda

I cannot get my pod to scale from 1 to 2 instances


I have a strategy that i'd like to implement that has a consumer(background worker) pod that uses keda to scale from 0 - 5 replicas.

The source of the scaling is kafka topic with a lagThreashold of 1:

{{- if .Values.keda.enabled }}
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: job-consumer-scaledobject
  namespace: {{ .Release.Namespace }}
  annotations:
    # Enable debug annotations for troubleshooting
    scaledobject.keda.sh/transfer-hpa-labels: "true"
spec:
  scaleTargetRef:
    name: job-consumer-app
  pollingInterval: {{ .Values.keda.pollingInterval }}
  minReplicaCount: {{ .Values.keda.minReplicas }}
  maxReplicaCount: {{ .Values.keda.maxReplicas }}
  idleReplicaCount: {{ .Values.keda.idleReplicas }}
  cooldownPeriod: {{ .Values.keda.cooldownPeriod }}
  triggers:
  - type: kafka
    metadata:
      bootstrapServers: "{{ .Values.kafka.serviceName }}.{{ .Release.Namespace }}.svc.cluster.local:{{ .Values.kafka.servicePort }}"
      consumerGroup: "{{ .Values.keda.consumerGroup }}"
      topic: "{{ .Values.keda.topic }}"
      lagThreshold: "{{ .Values.keda.lagThreshold }}"
      offsetResetPolicy: latest
      # Allow scaling from zero when consumer group doesn't exist yet
      allowIdleConsumers: "false"
      # Enable scaling from zero by checking topic lag even without active consumers
      scaleToZeroOnInvalidOffset: "false"
      # Add debug logging
      logLevel: "debug"
{{- end }}

The problem i am having is that it will scale from 0 - 1 just fine.. But it will not scale from 1 - 2 no matter what the current lag is.

Here are the keda-operator logs showing that it is querying the keta topic correctly:

2025-07-11T22:28:59Z    DEBUG   kafka_scaler    Kafka scaler: Providing metrics based on totalLag 500, topicPartitions 1, threshold 1   {"type": "ScaledObject", "namespace": "default", "name": "job-consumer-scaledobject"}

But when communicating the metric to the hpa it always sends 1:

2025-07-11T22:28:47Z    DEBUG   grpc_server     Providing metrics       {"scaledObjectName": "job-consumer-scaledobject", "scaledObjectNamespace": "default", "metrics": "&ExternalMetricValueList{ListMeta:{   <nil>},Items:[]ExternalMetricValue{ExternalMetricValue{MetricName:s0-kafka-jobs-topic,MetricLabels:map[string]string{},Timestamp:2025-07-11 22:28:47.980947591 +0000 UTC m=+236.787074315,WindowSeconds:nil,Value:{**{1000 -3}** {<nil>}  DecimalSI},},},}"}
kubectl describe hpa keda-hpa-job-consumer-scaledobject


Reference:                                       Deployment/job-consumer-app
**Metrics:                                         ( current / target )
  "s0-kafka-jobs-topic" (target average value):  1 / 1**
Min replicas:                                    1
Max replicas:                                    5
Deployment pods:                                 1 current / 1 desired
Conditions:
  Type            Status  Reason              Message
  ----            ------  ------              -------
  AbleToScale     True    ReadyForNewScale    recommended size matches current size
  ScalingActive   True    ValidMetricFound    the HPA was able to successfully calculate a replica count from external metric s0-kafka-jobs-topic(&LabelSelector{MatchLabels:map[string]string{scaledobject.keda.sh/name: job-consumer-scaledobject,},MatchExpressions:[]LabelSelectorRequirement{},})
  ScalingLimited  False   DesiredWithinRange  the desired count is within the acceptable range
Events:           <none>

Im pulling my hair out over hear trying to understand why my target stays at 1 when the lag is clear over 500. Any thoughts? Thanks!


Solution

  • topicPartitions 1
    

    It is impossible to scale this consumer group; you're limited to one consumer process within a group, per partition

    Even if your Deployment did scale, you'd have 4 idle pods after a consumer group rebalance and lag would therefore remain the same