kubernetescontainersgoogle-kubernetes-enginekubernetes-jobs

Handling Long-Running initContainers for Kubernetes Jobs


After a recent upgrade to GKE 1.26, I began encountering an issue related to a Kubernetes job that has been historically running without issue.

The job itself consists two components:

It looks something like the following in a nutshell (some things omitted for brevity):

apiVersion: batch/v1
kind: Job
metadata:
  name: my-job-{{ now | date "20060102150405" }}
  labels:
    app: my-job
spec:
  backoffLimit: 0
  template:
    metadata:
      labels:
        app: my-job
      annotations:
        "cluster-autoscaler.kubernetes.io/safe-to-evict": "true"
    spec:
      restartPolicy: Never
      ...
      initContainers:
      - name: wait-service
        ...
        command: ['bash', '-c', 'while [[ "$(curl -s -o /dev/null -w ''%{http_code}'' http://someService/api/v1/status)" != "200" ]]; do echo waiting for service; sleep 2s; done']
      containers:
        - name: run-job
          ...
      volumes:
          ...
      tolerations: 
          ...

The problem I’m encountering is that after ~5 minutes after a deployment, while the initContainer is running and awaiting the service, Kubernetes will create a new instance of the job (complete with its own initContainer etc.) This is problematic primarily because two instances of the script being run in the primary container (run-job) could easily cause the operations within it to get out of sync/into a bad state (the script involves the suspension and restoration of various services via the API in a specific order).

I can verify this within the logs of the original job:

│ wait-service waiting for service 
| failed container "run-job" in pod "my-job-20230721165715-rh6s2" is waiting to start: PodInitializing for .../my-job-20230721165715-rh6s2 (run-job)                            
| wait-service waiting for service   

So roughly ~5 minutes after a new deployment of this job, I have two instances of it running (aligning with the failed container message above). This typically ends with one or both of them in bad states.

I’ve attempted a few configuration changes with little success and I’m wondering what the best way to handle this would be? Essentially I need to allow an adequate toleration for the initContainer such that it doesn’t trigger the above failure and recreate a new job (but rather continue forth with the original instance).


Solution

  • Since you're using helm, and you've named the job using timestamped name (my-job-{{ now | date "20060102150405" }}), this will create a fresh job each time you do the helm install, but this makes no connection with the existing job(s) that may or may not be running at the time you do the upgrade.

    If you want to ensure existing jobs are terminated when you deploy, you should consider using pre-upgrade hooks to delete any existing jobs in the application namespace before the upgrade is applied.


    UPDATE 1

    I've spun up a 1.26 cluster and used your example (with a few tweaks in order to get it to run), left it for 10 minutes, and got no additional job or pods.

    What you can do in the meanwhile however, is trace the pods backwards to find out what "owns" them. If you kubectl describe {pod}, you'll see within the output a line reading "Controlled by". For example:

    Controlled By:  Job/example-service-deploy-jobs-20230722170514
    

    If you see two pods, describe both and see if the same job is referenced or not. If you have both pointing at the same job, then the job has spawned two pods -- this normally means it considered the first pod as failed and has spawned the second to try again.

    If you see a different job referenced, it means another job has been deployed without deleting the first one.

    Describe the jobs and see they it also have a "controlled by" field (they shouldn't if they were installed by Helm or manually deployed using kubectl apply or similar) -- my reason for this check is to see if something (like a cronjob) is triggering a job.

    Separate question: how is your cluster being hosted, is it bare metal or hosted (AKS, EKS, GKE, etc?)

    Another possibility, if you're running on hosted is that you're running on Spot/Preemptible instances, or the node is having some other issue. You can watch the nodes (watch kubectl get nodes) to see if any of them terminate while you're watching the init container -- and if they do, you can start investigating the reason for the node termination.

    In short, it is not the job itself that is the issue, but something else around it (or in the cluster).