dockerkubernetesranchercondor

[HTCONDOR][kubernetes / k8s] : Unable to start minicondor image within k8s - condor_master not working


POST EDIT

The issue is due to :

PSP (Pod security policy) By default escalation is not permit for my condor user. That is why it is not working. because the supervisord is running as root user and try to write logs and start condor collector as root and not as an other user (i.e condor)

Description

The mini-condor base image is not starting as expected on kubernetes rancher pod.

I am using :

ps : the image was working perfectly on :

  • a local env
  • minikube default installation

I am running it as a simple deployment :

When the pod is starting, the Kubernetes default log file is displaying :

2021-09-15 09:26:36,908 INFO supervisord started with pid 1
2021-09-15 09:26:37,911 INFO spawned: 'condor_master' with pid 20
2021-09-15 09:26:37,912 INFO spawned: 'condor_restd' with pid 21
2021-09-15 09:26:37,917 INFO exited: condor_restd (exit status 127; not expected)
2021-09-15 09:26:37,924 INFO exited: condor_master (exit status 4; not expected)
2021-09-15 09:26:38,926 INFO spawned: 'condor_master' with pid 22
2021-09-15 09:26:38,928 INFO spawned: 'condor_restd' with pid 23
2021-09-15 09:26:38,932 INFO exited: condor_restd (exit status 127; not expected)
2021-09-15 09:26:38,936 INFO exited: condor_master (exit status 4; not expected)
2021-09-15 09:26:40,939 INFO spawned: 'condor_master' with pid 24
2021-09-15 09:26:40,943 INFO spawned: 'condor_restd' with pid 25
2021-09-15 09:26:40,947 INFO exited: condor_restd (exit status 127; not expected)
2021-09-15 09:26:40,948 INFO exited: condor_master (exit status 4; not expected)
2021-09-15 09:26:43,953 INFO spawned: 'condor_master' with pid 26
2021-09-15 09:26:43,955 INFO spawned: 'condor_restd' with pid 27
2021-09-15 09:26:43,959 INFO exited: condor_restd (exit status 127; not expected)
2021-09-15 09:26:43,968 INFO gave up: condor_restd entered FATAL state, too many start retries too quickly
2021-09-15 09:26:43,969 INFO exited: condor_master (exit status 4; not expected)
2021-09-15 09:26:44,970 INFO gave up: condor_master entered FATAL state, too many start retries too quickly

Here is a brief cmd and output result:

CMD output
condor_status CEDAR:6001:Failed to connect to <127.0.0.1:9618>
condor_master ERROR "Cannot open log file '/var/log/condor/MasterLog'" at line 174 in file /var/lib/condor/execute/slot1/dir_17406/userdir/.tmpruBd6F/BUILD/condor-9.0.5/src/condor_utils/dprintf_setup.cpp`

1)first try to fix the issue

I decided to customize the image, but the error is the same

The docker images used to try to fix the permission issue

FROM htcondor/mini:9.2-el7

RUN condor_master

RUN chown condor:root /var/
RUN chown condor:root /var/log
RUN chown -R condor:root /var/log/
RUN chown -R condor:condor /var/log/condor

RUN chown condor:condor /var/log/condor/ProcLog
RUN chown condor:condor /var/log/condor/MasterLog

RUN chmod 775 -R /var/
apiVersion: apps/v1
kind: Deployment
metadata:
  name: htcondor-mini--all-in-one
  namespace: grafana-exporter
    spec:
      containers:
      - image: <custom_image>
        imagePullPolicy: Always
        name: htcondor-mini--all-in-one
        resources: {}
        securityContext:
          capabilities: {}
        stdin: true
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        tty: true
      dnsConfig: {}
      dnsPolicy: ClusterFirst
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      terminationGracePeriodSeconds: 30

Here is a brief cmd and output result:

CMD output
condor_status CEDAR:6001:Failed to connect to <127.0.0.1:9618>
condor_master ERROR "Cannot open log file '/var/log/condor/MasterLog'" at line 174 in file /var/lib/condor/execute/slot1/dir_17406/userdir/.tmpruBd6F/BUILD/condor-9.0.5/src/condor_utils/dprintf_setup.cpp`
ls -ld /var/ drwxrwxr-x 1 condor root 17 Nov 13 2020 /var/
ls -ld /var/log/ drwxrwxr-x 1 condor root 65 Oct 7 11:54 /var/log/
ls -ld /var/log/condor drwxrwxr-x 1 condor condor 240 Oct 7 11:23 /var/log/condor
ls -ld /var/log/condor/MasterLog -rwxrwxr-x 1 condor condor 3243 Oct 7 11:23 /var/log/condor/MasterLog

MasterLog content :

10/07/21 11:23:21 ******************************************************
10/07/21 11:23:21 ** condor_master (CONDOR_MASTER) STARTING UP
10/07/21 11:23:21 ** /usr/sbin/condor_master
10/07/21 11:23:21 ** SubsystemInfo: name=MASTER type=MASTER(2) class=DAEMON(1)
10/07/21 11:23:21 ** Configuration: subsystem:MASTER local:<NONE> class:DAEMON
10/07/21 11:23:21 ** $CondorVersion: 9.2.0 Sep 23 2021 BuildID: 557262 PackageID: 9.2.0-1 $
10/07/21 11:23:21 ** $CondorPlatform: x86_64_CentOS7 $
10/07/21 11:23:21 ** PID = 7
10/07/21 11:23:21 ** Log last touched time unavailable (No such file or directory)
10/07/21 11:23:21 ******************************************************
10/07/21 11:23:21 Using config source: /etc/condor/condor_config
10/07/21 11:23:21 Using local config sources: 
10/07/21 11:23:21    /etc/condor/config.d/00-htcondor-9.0.config
10/07/21 11:23:21    /etc/condor/config.d/00-minicondor
10/07/21 11:23:21    /etc/condor/config.d/01-misc.conf
10/07/21 11:23:21    /etc/condor/condor_config.local
10/07/21 11:23:21 config Macros = 73, Sorted = 73, StringBytes = 1848, TablesBytes = 2692
10/07/21 11:23:21 CLASSAD_CACHING is OFF
10/07/21 11:23:21 Daemon Log is logging: D_ALWAYS D_ERROR
10/07/21 11:23:21 SharedPortEndpoint: waiting for connections to named socket master_7_43af
10/07/21 11:23:21 SharedPortEndpoint: failed to open /var/lock/condor/shared_port_ad: No such file or directory
10/07/21 11:23:21 SharedPortEndpoint: did not successfully find SharedPortServer address. Will retry in 60s.
10/07/21 11:23:21 Permission denied error during DISCARD_SESSION_KEYRING_ON_STARTUP, continuing anyway
10/07/21 11:23:21 Adding SHARED_PORT to DAEMON_LIST, because USE_SHARED_PORT=true (to disable this, set AUTO_INCLUDE_SHARED_PORT_IN_DAEMON_LIST=False)
10/07/21 11:23:21 SHARED_PORT is in front of a COLLECTOR, so it will use the configured collector port
10/07/21 11:23:21 Master restart (GRACEFUL) is watching /usr/sbin/condor_master (mtime:1632433213)
10/07/21 11:23:21 Cannot remove wait-for-startup file /var/lock/condor/shared_port_ad
10/07/21 11:23:21 WARNING: forward resolution of ip6-localhost doesn't match 127.0.0.1!
10/07/21 11:23:21 WARNING: forward resolution of ip6-loopback doesn't match 127.0.0.1!
10/07/21 11:23:22 Started DaemonCore process "/usr/libexec/condor/condor_shared_port", pid and pgroup = 9
10/07/21 11:23:22 Waiting for /var/lock/condor/shared_port_ad to appear.
10/07/21 11:23:22 Found /var/lock/condor/shared_port_ad.
10/07/21 11:23:22 Cannot remove wait-for-startup file /var/log/condor/.collector_address
10/07/21 11:23:23 Started DaemonCore process "/usr/sbin/condor_collector", pid and pgroup = 10
10/07/21 11:23:23 Waiting for /var/log/condor/.collector_address to appear.
10/07/21 11:23:23 Found /var/log/condor/.collector_address.
10/07/21 11:23:23 Started DaemonCore process "/usr/sbin/condor_negotiator", pid and pgroup = 11
10/07/21 11:23:23 Started DaemonCore process "/usr/sbin/condor_schedd", pid and pgroup = 12
10/07/21 11:23:24 Started DaemonCore process "/usr/sbin/condor_startd", pid and pgroup = 15
10/07/21 11:23:24 Daemons::StartAllDaemons all daemons were started

A huge thanks for reading. Hope it will help many other people.


Solution

  • Cause of the issue

    The issue is due to :

    PSP policy (Pod security policy) By default escalation is not permit for my condor user.

    SOLUTION

    THE BEST SOLUTION I found at the moment is to run EVERYTHING as condor user and give the permisssion to the condor users. To do so you need :

    Dockerfile

    FROM htcondor/mini:9.2-el7
    
    # SET WORKDIR
    WORKDIR /home/condor/
    RUN chown condor:condor /home/condor
    
    # COPY SUPERVISOR
    COPY supervisord.conf /etc/supervisord.conf
    
    # Need to run the cmd to create all dir
    RUN condor_master
    
    # FIX PERMISSION ISSUES FOR RANCHER
    RUN chown -R condor:condor /var/log/ /tmp &&\
     chown -R restd:restd /home/restd &&\
     chmod 755 -R /home/restd
    
    

    supervisord.conf:

    [supervisord]
    user=condor
    nodaemon=true
    logfile = /tmp/supervisord.log
    directory = /tmp
    pidfile = /tmp/supervisord.pid
    childlogdir = /tmp
    
    # next 3 sections contain using supervisorctl to manage daemons
    [unix_http_server]
    file=/tmp/supervisord.sock
    chown=condor:condor
    chmod=0777
    user=condor
    
    [rpcinterface:supervisor]
    supervisor.rpcinterface_factory = supervisor.rpcinterface:make_main_rpcinterface
    
    [supervisorctl]
    serverurl=unix:///tmp/supervisor.sock
    
    [program:condor_master]
    user=condor
    command=/usr/sbin/condor_master -f
    autostart=true
    autorestart=true
    redirect_stderr=true
    stdout_logfile = /var/log/condor_master.log
    stderr_logfile = /var/log/condor_master.error.log
    

    deployment.yaml

    apiVersion: apps/v1
    kind: Deployment
    spec:
          containers:
          - image: <condor-image>
            imagePullPolicy: Always
            name: htcondor-exporter
            ports:
            - containerPort: 8080
              name: myport
              protocol: TCP
            resources: {}
            securityContext:
              capabilities: {}
              runAsNonRoot: false
              runAsUser: 64
            stdin: true
            terminationMessagePath: /dev/termination-log
            terminationMessagePolicy: File
            tty: true