amazon-web-servicesdockerkubernetesamqpkubernetes-deployment

CrashLoopBackOff runs intermittently on pods


I have a Kubernetes cluster running using EKS (Elastic Kubernetes Service) and ECR (Elastic Container Repository) on AWS. One specific deployment of mine runs fine for the first two/three restarts before then always initialising a CrashLoopBackOff on image pull, waiting for the length of the Back Off and then running fine, before repeating the process.

These pods consist of a docker container which waits for a message from a message queue, runs a process, then the docker container stops at which point the deployment will restart the container, always pulling the container from ECR.

As these pods are intended to handle a lot of traffic and have a short runtime (~1-30 seconds), having each pod immediately enter CrashLoopBackOff on pull and then wait for five minutes before actually running is annoying with a lot of waiting time.

I've had a look around for any answers to this, but all the questions I've seen describe cases where CrashLoopBackOff continues to run indefinitely, rather than a pod entering CrashLoopBackOff then running successfully once the wait time has finished.

I've checked the logs for the pods which have this issue and there is nothing there which indicates any errors. I'm wondering if there is a way to "pause" the container after it is pulled, to ensure it is up and running correctly before the docker command is actually run? Or any other way to delay CrashLoopBackOff for a configurable amount of seconds? I've added "sleep 15;" to the start of my docker container command, but that hasn't helped the issue.

Deployment Yaml

apiVersion: apps/v1
kind: Deployment
metadata:
  name: piml-xgboost
spec:
  replicas: 5
  selector:
    matchLabels:
      app: piml-xgboost
  template:
    metadata:
      labels: 
        app: piml-xgboost
    spec:
      serviceAccountName: cluster-service-account
      containers:
      - name: piml-unet
        image: 'ecr_path'
        imagePullPolicy: "Always"
        resources:
          requests:
            memory: "500Mi"
          limits:
            memory: "4Gi"
        env:
        - name: BROKER_URL
          value: 'amqp_broker_url'
        - name: QUEUE
          value: 'amqp_queue'
        - name: method
          value: xgboost
        - name: k8s
          value: 'True'

Typical 'kubectl get pods' output:

NAME                                    READY   STATUS             RESTARTS          AGE
piml-xgboost-77d48f9db8-5txmz           0/1     CrashLoopBackOff   959 (2m51s ago)   3d21h
piml-xgboost-77d48f9db8-gs542           0/1     CrashLoopBackOff   532 (108s ago)    2d1h
piml-xgboost-77d48f9db8-pmvlg           0/1     CrashLoopBackOff   979 (44s ago)     3d23h
piml-xgboost-77d48f9db8-wckmk           0/1     CrashLoopBackOff   533 (59s ago)     2d1h
piml-xgboost-77d48f9db8-wz657           0/1     CrashLoopBackOff   712 (2m39s ago)   2d21h

Docker command from Dockerfile

CMD sleep 5;/usr/bin/amqp-consume --url=$BROKER_URL -q $QUEUE -c 1 ./docker_script.py

Solution

  • Deployment is not suitable for your use case. A deployment is designed for services that run permanently e.g. for serving a rest service or a worker that register to a message queue (seems to be tight to your use case). When a container stops, as you noted, kubernetes will restart it, but if that happens more often it is considered to be in an errornous state.

    You may have two options:

    1. redesign your app to not stop after it finished its work but listen again on the queue for new messages

    2. switch from deployment to cron job that runs every 5 seconds (and remove the sleep time from the container's command)