kuberneteswebsocketazure-aksazure-application-gatewayhorizontal-scaling

How to scale Websocket Connections with Azure Application Gateway and AKS


We want to dynamically scale our AKS Cluster based on the number of Websocket connections.

We use Application Gateway V2 along with Application Gateway Ingress Controller on AKS as Ingress.

I configured HorizontalPodAutoscaler to scale the deployment based on the consumed memory.

When i deploy the sample app to AKS i can connect to the websocket endpoints and communicate. However, when any scale operation happens (pods added or removed) i see connection losses on all the clients.

I tried activating cookie based affinity on application gateway but this had no effect on the issue.

Below is the deployment i use for testing. It is based on this sample and modified a but so it allows to specify the number of connections and regularily sends ping messages to the server.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: wssample
spec:
  replicas: 1
  selector:
    matchLabels:
      app: wssample
  template:
    metadata:
      labels:
        app: wssample
    spec:
      containers:
      - name: wssamplecontainer
        image: marxx/websocketssample:10
        resources:
          requests:
            memory: "100Mi"
            cpu: "50m"
          limits:
            memory: "150Mi"
            cpu: "100m"
        ports:
        - containerPort: 80
          name: wssample
---
apiVersion: v1
kind: Service
metadata:
  name: wssample-service
spec:
  ports:
  - port: 80
  selector:
    app: wssample
---
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: websocket-ingress
  annotations:
    kubernetes.io/ingress.class: azure/application-gateway
    appgw.ingress.kubernetes.io/cookie-based-affinity: "true"
    appgw.ingress.kubernetes.io/connection-draining: "true"
    appgw.ingress.kubernetes.io/connection-draining-timeout: "60"
spec:
  rules:
  - http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name:  wssample-service
            port: 
              number: 80
---
apiVersion: autoscaling/v2beta2
kind: HorizontalPodAutoscaler
metadata:
  name: websocket-scaler
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: wssample
  minReplicas: 1
  maxReplicas: 10
  metrics:
  - type: Resource
    resource: 
      name: memory
      target:
        type: Utilization
        averageUtilization: 50

Update:

Update 2:

I found a pattern. The problem occurs also without HPA and can be reproduced using the following steps:

  1. Scale Deployment to 3 Replicas
  2. Connect 20 Clients
  3. Manually Scale Deployment to 6 Replicas with kubectl scale command
  4. (existing connections are still fine and clients communicate with backend)
  5. Connect another 20 Clients
  6. After a few seconds all the existing connections are reset

Update 3:


Solution

  • I made a very unpleasant discovery. The outcome of this GitHub issue basically says that the behavior is by design and AGW resets all websocket connections when any backend pool rules change (which happens during scale operations).

    It's possible to vote for a feature to keep those connections in those situations.