The issue is due to :
PSP
(Pod security policy
) By default escalation is not permit for my condor
user. That is why it is not working. because the supervisord
is running as root
user and try to write logs and start condor collector as root
and not as an other user (i.e condor
)
The mini-condor
base image is not starting as expected on kubernetes rancher pod.
I am using :
ps : the image was working perfectly on :
- a local env
- minikube default installation
I am running it as a simple deployment :
When the pod is starting, the Kubernetes default log file is displaying :
2021-09-15 09:26:36,908 INFO supervisord started with pid 1
2021-09-15 09:26:37,911 INFO spawned: 'condor_master' with pid 20
2021-09-15 09:26:37,912 INFO spawned: 'condor_restd' with pid 21
2021-09-15 09:26:37,917 INFO exited: condor_restd (exit status 127; not expected)
2021-09-15 09:26:37,924 INFO exited: condor_master (exit status 4; not expected)
2021-09-15 09:26:38,926 INFO spawned: 'condor_master' with pid 22
2021-09-15 09:26:38,928 INFO spawned: 'condor_restd' with pid 23
2021-09-15 09:26:38,932 INFO exited: condor_restd (exit status 127; not expected)
2021-09-15 09:26:38,936 INFO exited: condor_master (exit status 4; not expected)
2021-09-15 09:26:40,939 INFO spawned: 'condor_master' with pid 24
2021-09-15 09:26:40,943 INFO spawned: 'condor_restd' with pid 25
2021-09-15 09:26:40,947 INFO exited: condor_restd (exit status 127; not expected)
2021-09-15 09:26:40,948 INFO exited: condor_master (exit status 4; not expected)
2021-09-15 09:26:43,953 INFO spawned: 'condor_master' with pid 26
2021-09-15 09:26:43,955 INFO spawned: 'condor_restd' with pid 27
2021-09-15 09:26:43,959 INFO exited: condor_restd (exit status 127; not expected)
2021-09-15 09:26:43,968 INFO gave up: condor_restd entered FATAL state, too many start retries too quickly
2021-09-15 09:26:43,969 INFO exited: condor_master (exit status 4; not expected)
2021-09-15 09:26:44,970 INFO gave up: condor_master entered FATAL state, too many start retries too quickly
Here is a brief cmd and output result:
CMD | output |
---|---|
condor_status |
CEDAR:6001:Failed to connect to <127.0.0.1:9618> |
condor_master |
ERROR "Cannot open log file '/var/log/condor/MasterLog'" at line 174 in file /var/lib/condor/execute/slot1/dir_17406/userdir/.tmpruBd6F/BUILD/condor-9.0.5/src/condor_utils/dprintf_setup.cpp` |
I decided to customize the image, but the error is the same
The docker images used to try to fix the permission issue
FROM htcondor/mini:9.2-el7
RUN condor_master
RUN chown condor:root /var/
RUN chown condor:root /var/log
RUN chown -R condor:root /var/log/
RUN chown -R condor:condor /var/log/condor
RUN chown condor:condor /var/log/condor/ProcLog
RUN chown condor:condor /var/log/condor/MasterLog
RUN chmod 775 -R /var/
apiVersion: apps/v1
kind: Deployment
metadata:
name: htcondor-mini--all-in-one
namespace: grafana-exporter
spec:
containers:
- image: <custom_image>
imagePullPolicy: Always
name: htcondor-mini--all-in-one
resources: {}
securityContext:
capabilities: {}
stdin: true
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
tty: true
dnsConfig: {}
dnsPolicy: ClusterFirst
restartPolicy: Always
schedulerName: default-scheduler
securityContext: {}
terminationGracePeriodSeconds: 30
Here is a brief cmd and output result:
CMD | output |
---|---|
condor_status |
CEDAR:6001:Failed to connect to <127.0.0.1:9618> |
condor_master |
ERROR "Cannot open log file '/var/log/condor/MasterLog'" at line 174 in file /var/lib/condor/execute/slot1/dir_17406/userdir/.tmpruBd6F/BUILD/condor-9.0.5/src/condor_utils/dprintf_setup.cpp` |
ls -ld /var/ |
drwxrwxr-x 1 condor root 17 Nov 13 2020 /var/ |
ls -ld /var/log/ |
drwxrwxr-x 1 condor root 65 Oct 7 11:54 /var/log/ |
ls -ld /var/log/condor |
drwxrwxr-x 1 condor condor 240 Oct 7 11:23 /var/log/condor |
ls -ld /var/log/condor/MasterLog |
-rwxrwxr-x 1 condor condor 3243 Oct 7 11:23 /var/log/condor/MasterLog |
MasterLog content :
10/07/21 11:23:21 ******************************************************
10/07/21 11:23:21 ** condor_master (CONDOR_MASTER) STARTING UP
10/07/21 11:23:21 ** /usr/sbin/condor_master
10/07/21 11:23:21 ** SubsystemInfo: name=MASTER type=MASTER(2) class=DAEMON(1)
10/07/21 11:23:21 ** Configuration: subsystem:MASTER local:<NONE> class:DAEMON
10/07/21 11:23:21 ** $CondorVersion: 9.2.0 Sep 23 2021 BuildID: 557262 PackageID: 9.2.0-1 $
10/07/21 11:23:21 ** $CondorPlatform: x86_64_CentOS7 $
10/07/21 11:23:21 ** PID = 7
10/07/21 11:23:21 ** Log last touched time unavailable (No such file or directory)
10/07/21 11:23:21 ******************************************************
10/07/21 11:23:21 Using config source: /etc/condor/condor_config
10/07/21 11:23:21 Using local config sources:
10/07/21 11:23:21 /etc/condor/config.d/00-htcondor-9.0.config
10/07/21 11:23:21 /etc/condor/config.d/00-minicondor
10/07/21 11:23:21 /etc/condor/config.d/01-misc.conf
10/07/21 11:23:21 /etc/condor/condor_config.local
10/07/21 11:23:21 config Macros = 73, Sorted = 73, StringBytes = 1848, TablesBytes = 2692
10/07/21 11:23:21 CLASSAD_CACHING is OFF
10/07/21 11:23:21 Daemon Log is logging: D_ALWAYS D_ERROR
10/07/21 11:23:21 SharedPortEndpoint: waiting for connections to named socket master_7_43af
10/07/21 11:23:21 SharedPortEndpoint: failed to open /var/lock/condor/shared_port_ad: No such file or directory
10/07/21 11:23:21 SharedPortEndpoint: did not successfully find SharedPortServer address. Will retry in 60s.
10/07/21 11:23:21 Permission denied error during DISCARD_SESSION_KEYRING_ON_STARTUP, continuing anyway
10/07/21 11:23:21 Adding SHARED_PORT to DAEMON_LIST, because USE_SHARED_PORT=true (to disable this, set AUTO_INCLUDE_SHARED_PORT_IN_DAEMON_LIST=False)
10/07/21 11:23:21 SHARED_PORT is in front of a COLLECTOR, so it will use the configured collector port
10/07/21 11:23:21 Master restart (GRACEFUL) is watching /usr/sbin/condor_master (mtime:1632433213)
10/07/21 11:23:21 Cannot remove wait-for-startup file /var/lock/condor/shared_port_ad
10/07/21 11:23:21 WARNING: forward resolution of ip6-localhost doesn't match 127.0.0.1!
10/07/21 11:23:21 WARNING: forward resolution of ip6-loopback doesn't match 127.0.0.1!
10/07/21 11:23:22 Started DaemonCore process "/usr/libexec/condor/condor_shared_port", pid and pgroup = 9
10/07/21 11:23:22 Waiting for /var/lock/condor/shared_port_ad to appear.
10/07/21 11:23:22 Found /var/lock/condor/shared_port_ad.
10/07/21 11:23:22 Cannot remove wait-for-startup file /var/log/condor/.collector_address
10/07/21 11:23:23 Started DaemonCore process "/usr/sbin/condor_collector", pid and pgroup = 10
10/07/21 11:23:23 Waiting for /var/log/condor/.collector_address to appear.
10/07/21 11:23:23 Found /var/log/condor/.collector_address.
10/07/21 11:23:23 Started DaemonCore process "/usr/sbin/condor_negotiator", pid and pgroup = 11
10/07/21 11:23:23 Started DaemonCore process "/usr/sbin/condor_schedd", pid and pgroup = 12
10/07/21 11:23:24 Started DaemonCore process "/usr/sbin/condor_startd", pid and pgroup = 15
10/07/21 11:23:24 Daemons::StartAllDaemons all daemons were started
A huge thanks for reading. Hope it will help many other people.
The issue is due to :
PSP policy
(Pod security policy)
By default escalation is not permit for my condor user.
THE BEST SOLUTION I found at the moment is to run EVERYTHING as condor user and give the permisssion to the condor users. To do so you need :
supervisord.conf
: Run supervisor as condor
usersupervisord.conf
: run log and socket in /tmp
Dockerfile
: Change the owner of most of folder by condor
deployment.yaml
set the ID
to 64
(condor user)Dockerfile
FROM htcondor/mini:9.2-el7
# SET WORKDIR
WORKDIR /home/condor/
RUN chown condor:condor /home/condor
# COPY SUPERVISOR
COPY supervisord.conf /etc/supervisord.conf
# Need to run the cmd to create all dir
RUN condor_master
# FIX PERMISSION ISSUES FOR RANCHER
RUN chown -R condor:condor /var/log/ /tmp &&\
chown -R restd:restd /home/restd &&\
chmod 755 -R /home/restd
supervisord.conf
:
[supervisord]
user=condor
nodaemon=true
logfile = /tmp/supervisord.log
directory = /tmp
pidfile = /tmp/supervisord.pid
childlogdir = /tmp
# next 3 sections contain using supervisorctl to manage daemons
[unix_http_server]
file=/tmp/supervisord.sock
chown=condor:condor
chmod=0777
user=condor
[rpcinterface:supervisor]
supervisor.rpcinterface_factory = supervisor.rpcinterface:make_main_rpcinterface
[supervisorctl]
serverurl=unix:///tmp/supervisor.sock
[program:condor_master]
user=condor
command=/usr/sbin/condor_master -f
autostart=true
autorestart=true
redirect_stderr=true
stdout_logfile = /var/log/condor_master.log
stderr_logfile = /var/log/condor_master.error.log
deployment.yaml
apiVersion: apps/v1
kind: Deployment
spec:
containers:
- image: <condor-image>
imagePullPolicy: Always
name: htcondor-exporter
ports:
- containerPort: 8080
name: myport
protocol: TCP
resources: {}
securityContext:
capabilities: {}
runAsNonRoot: false
runAsUser: 64
stdin: true
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
tty: true