We want to dynamically scale our AKS Cluster based on the number of Websocket connections.
We use Application Gateway V2 along with Application Gateway Ingress Controller on AKS as Ingress.
I configured HorizontalPodAutoscaler to scale the deployment based on the consumed memory.
When i deploy the sample app to AKS i can connect to the websocket endpoints and communicate. However, when any scale operation happens (pods added or removed) i see connection losses on all the clients.
I tried activating cookie based affinity on application gateway but this had no effect on the issue.
Below is the deployment i use for testing. It is based on this sample and modified a but so it allows to specify the number of connections and regularily sends ping messages to the server.
apiVersion: apps/v1
kind: Deployment
metadata:
name: wssample
spec:
replicas: 1
selector:
matchLabels:
app: wssample
template:
metadata:
labels:
app: wssample
spec:
containers:
- name: wssamplecontainer
image: marxx/websocketssample:10
resources:
requests:
memory: "100Mi"
cpu: "50m"
limits:
memory: "150Mi"
cpu: "100m"
ports:
- containerPort: 80
name: wssample
---
apiVersion: v1
kind: Service
metadata:
name: wssample-service
spec:
ports:
- port: 80
selector:
app: wssample
---
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: websocket-ingress
annotations:
kubernetes.io/ingress.class: azure/application-gateway
appgw.ingress.kubernetes.io/cookie-based-affinity: "true"
appgw.ingress.kubernetes.io/connection-draining: "true"
appgw.ingress.kubernetes.io/connection-draining-timeout: "60"
spec:
rules:
- http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: wssample-service
port:
number: 80
---
apiVersion: autoscaling/v2beta2
kind: HorizontalPodAutoscaler
metadata:
name: websocket-scaler
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: wssample
minReplicas: 1
maxReplicas: 10
metrics:
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 50
Update:
Update 2:
I found a pattern. The problem occurs also without HPA and can be reproduced using the following steps:
Update 3:
I made a very unpleasant discovery. The outcome of this GitHub issue basically says that the behavior is by design and AGW resets all websocket connections when any backend pool rules change (which happens during scale operations).
It's possible to vote for a feature to keep those connections in those situations.