dockerjenkinskubernetes-helmartifactoryjenkins-groovy

New Jenkins Agent Containers Will Not Run Properly


We have been in the process of setting up a new Jenkins instance and as we have onboarded various projects we have created Docker images to be used in K8s pods to execute various pipelines. Suddenly, we have run into a problem where new agents throw the following errors:

cp: cannot create regular file '/home/jenkins/agent/workspace/Agents/Corretto-Maven@tmp/durable-5eb2af2f/script.sh.copy': Permission denied
11:23:55  sh: 1: cannot create /home/jenkins/agent/workspace/Agents/Corretto-Maven@tmp/durable-5eb2af2f/jenkins-log.txt: Permission denied
11:23:55  sh: 1: cannot create /home/jenkins/agent/workspace/Agents/Corretto-Maven@tmp/durable-5eb2af2f/jenkins-result.txt.tmp: Permission denied
11:23:55  touch: cannot touch '/home/jenkins/agent/workspace/Agents/Corretto-Maven@tmp/durable-5eb2af2f/jenkins-log.txt': Permission denied
11:23:55  mv: cannot stat '/home/jenkins/agent/workspace/Agents/Corretto-Maven@tmp/durable-5eb2af2f/jenkins-result.txt.tmp': No such file or directory
...
11:29:02  process apparently never started in /home/jenkins/agent/workspace/Agents/Corretto-Maven@tmp/durable-5eb2af2f

In the K8s dashboard, we can see the container has started, but nothing else until it fails. When we searched the error, the most relevant answer came up and we checked the Durable Task Plugin finding we have the latest version, 577.v2a_8a_4b_7c0247. Subsequent searches yielded no results of value for this error.

It was even more puzzling because every Docker image agent created prior to August still runs perfectly and shows no sign of the errors.

More puzzling? We created a new Docker image using a basic Dockerfile we had used prior to August and while the old image runs perfectly, the new one exhibits the same error conditions as any of the new images.

Our Process

  1. Create a Dockerfile for an image.
    • test the image locally ti ensure it builds properly and behaves as expected
  2. Use Jenkins to build the Docker image and push the result to Artifactory.
  3. Create a YAML file for the pod images
  4. Load that YAML as a library in the Jenkinsfile
  5. Load the container from the named image

Dockerfile

# Base Image to customize a Jenkins Remote Agent.
FROM ubuntu:20.04

# variables
ENV USERNAME jenkins
ENV USERDIR /var/$USERNAME

# add a user and group
RUN useradd -u 1001 -U -c $USERNAME -d /var/jenkins -m -s /bin/bash $USERNAME
RUN mkdir /home/$USERNAME
RUN chown $USERNAME:$USERNAME /home/$USERNAME
WORKDIR /home/$USERNAME 

# connection files required (connects to various services, such as Git, Artifactory, etc.)
COPY jen_files.tar /var/jenkins/
RUN tar xvf /var/jenkins/jen_files.tar --directory /var/jenkins/
RUN rm -f /var/jenkins/jen_files.tar

# install tools 
RUN apt-get update && apt-get install -y \
    jq \
    git \
    tar \
    zip \
    curl \
    wget \
    sudo

USER $USERNAME

CMD ["/bin/bash", "-c", "bash"]

Pod YAML

apiVersion: v1
kind: Pod
metadata:
  name: ubuntu-pod-yaml
spec:
  serviceAccountName: jenkins
  imagePullSecrets:
    - name: regcred
  containers:
    - name: ubuntu2004
      image: 'ourartifacts.jfrog.io/docker-local/jenkins-remote-agents:ubuntu2004'
      imagePullPolicy: Always
      command:
        - sleep
      args:
        - 99d
      tty: true

Jenkinsfile

// load shared library via @Library or other methods
@Library('SCMLibraries@ubuntu-pod-tests')_ // Load External Libraries
def podDefs = libraryResource('./ubuntu-pod.yaml') 

pipeline {
    agent any
    stages {
        stage('Pipeline Start') {
            stages {
                stage('SCM Library') {
                    agent {
                        kubernetes {
                            defaultContainer 'jnlp'
                            yaml podDefs
                        }
                    }
                    steps {
                        sh 'ls -la'
                        sh 'whoami'
                        sh 'echo $UID'
                        container('ubuntu2004') { // this is when we start to see the errors shown
                            sh 'whoami'
                            sh 'cat /etc/os*'
                        }
                    }
                }
            }
        }
    }
}

This method has worked for every agent image we have defined up until August. The problem did not appear until we started the process again a couple of weeks ago, to build new images for projects we are wanting to host on the new Jenkins instance. The Dockerfile shown here is the same Dockerfile we used to define an image that works prior to August.

What we have checked

What is causing these errors? What are we missing? Are there any other details which would make it clearer?


Solution

  • ARGH! (In a good way, maybe.)

    I found the answer/workaround buried in this question. I added the runAsUser to the pod definition:

    apiVersion: v1
    kind: Pod
    metadata:
      name: ubuntu-pod-yaml
    spec:
      serviceAccountName: jenkins
      imagePullSecrets:
        - name: regcred
      containers:
        - name: ubuntu2004
          image: 'ourartifacts.jfrog.io/docker-local/jenkins-remote-agents:ubuntu2004'
          imagePullPolicy: Always
          command:
            - sleep
          args:
            - 99d
          tty: true
          securityContext:
            runAsUser: 0
    

    The errors went away and the pipeline is now working.

    HOWEVER, while this is a good workaround, it does not totally address the original problem. All of the containers that normally work run as user 0 when the container starts without the intervention in the pod container definition.

    UPDATE, after much head-scratching we were able to zero in on the original problem.

    During the builds of various Docker images for Jenkins agents, one of the base images used had its user set to jenkins. This setting was cached and re-used over and over again. Once we added --no-cache to the Docker build parameters, the issue went away. This means that we no longer have to add the following lines to the YAML file:

    securityContext:
      runAsUser: 0
    

    Because of the large number of users, we have decided to leave those lines in for clarity and have added information to our internal docs to highlight what is being done.