I have a strategy that i'd like to implement that has a consumer(background worker) pod that uses keda to scale from 0 - 5 replicas.
The source of the scaling is kafka topic with a lagThreashold of 1:
{{- if .Values.keda.enabled }}
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: job-consumer-scaledobject
namespace: {{ .Release.Namespace }}
annotations:
# Enable debug annotations for troubleshooting
scaledobject.keda.sh/transfer-hpa-labels: "true"
spec:
scaleTargetRef:
name: job-consumer-app
pollingInterval: {{ .Values.keda.pollingInterval }}
minReplicaCount: {{ .Values.keda.minReplicas }}
maxReplicaCount: {{ .Values.keda.maxReplicas }}
idleReplicaCount: {{ .Values.keda.idleReplicas }}
cooldownPeriod: {{ .Values.keda.cooldownPeriod }}
triggers:
- type: kafka
metadata:
bootstrapServers: "{{ .Values.kafka.serviceName }}.{{ .Release.Namespace }}.svc.cluster.local:{{ .Values.kafka.servicePort }}"
consumerGroup: "{{ .Values.keda.consumerGroup }}"
topic: "{{ .Values.keda.topic }}"
lagThreshold: "{{ .Values.keda.lagThreshold }}"
offsetResetPolicy: latest
# Allow scaling from zero when consumer group doesn't exist yet
allowIdleConsumers: "false"
# Enable scaling from zero by checking topic lag even without active consumers
scaleToZeroOnInvalidOffset: "false"
# Add debug logging
logLevel: "debug"
{{- end }}
The problem i am having is that it will scale from 0 - 1 just fine.. But it will not scale from 1 - 2 no matter what the current lag is.
Here are the keda-operator logs showing that it is querying the keta topic correctly:
2025-07-11T22:28:59Z DEBUG kafka_scaler Kafka scaler: Providing metrics based on totalLag 500, topicPartitions 1, threshold 1 {"type": "ScaledObject", "namespace": "default", "name": "job-consumer-scaledobject"}
But when communicating the metric to the hpa it always sends 1:
2025-07-11T22:28:47Z DEBUG grpc_server Providing metrics {"scaledObjectName": "job-consumer-scaledobject", "scaledObjectNamespace": "default", "metrics": "&ExternalMetricValueList{ListMeta:{ <nil>},Items:[]ExternalMetricValue{ExternalMetricValue{MetricName:s0-kafka-jobs-topic,MetricLabels:map[string]string{},Timestamp:2025-07-11 22:28:47.980947591 +0000 UTC m=+236.787074315,WindowSeconds:nil,Value:{**{1000 -3}** {<nil>} DecimalSI},},},}"}
kubectl describe hpa keda-hpa-job-consumer-scaledobject
Reference: Deployment/job-consumer-app
**Metrics: ( current / target )
"s0-kafka-jobs-topic" (target average value): 1 / 1**
Min replicas: 1
Max replicas: 5
Deployment pods: 1 current / 1 desired
Conditions:
Type Status Reason Message
---- ------ ------ -------
AbleToScale True ReadyForNewScale recommended size matches current size
ScalingActive True ValidMetricFound the HPA was able to successfully calculate a replica count from external metric s0-kafka-jobs-topic(&LabelSelector{MatchLabels:map[string]string{scaledobject.keda.sh/name: job-consumer-scaledobject,},MatchExpressions:[]LabelSelectorRequirement{},})
ScalingLimited False DesiredWithinRange the desired count is within the acceptable range
Events: <none>
Im pulling my hair out over hear trying to understand why my target stays at 1 when the lag is clear over 500. Any thoughts? Thanks!
topicPartitions 1
It is impossible to scale this consumer group; you're limited to one consumer process within a group, per partition
Even if your Deployment did scale, you'd have 4 idle pods after a consumer group rebalance and lag would therefore remain the same