I have a multi-container pod that houses an nginx container and a php-fpm container. they communicate over a file sock.
Years ago I learned that each container could isolate its own readiness, and then when both were ready, the whole pod would come online and accept traffic. I'm really not sure why that isn't working anymore.
Now it seems like no matter what, php is succeeding its readiness check but it doesn't appear to actually be online:
2025/04/08 04:29:25 [crit] 13#13: *293 connect() to unix:/sock/php.sock failed (2: No such file or directory) while connecting to upstream,
These errors persist for a couple of minutes:
nginx:
- name: nginx
image: {{.Values.image.nginx.repository}}:{{.Values.image.tag}}
imagePullPolicy: IfNotPresent
lifecycle:
preStop:
exec:
command: [
"/bin/sh", "-c",
# Introduce a delay to the shutdown sequence to wait for the
# pod eviction event to propagate. Then, gracefully shutdown
# nginx.
"sleep 10;",
"/usr/sbin/nginx -s quit;"
,
]
readinessProbe:
httpGet:
path: /online-status
port: 8080
initialDelaySeconds: 35
timeoutSeconds: 4
periodSeconds: 5
successThreshold: 15
failureThreshold: 2
volumeMounts:
- name: php-socket
mountPath: /sock
php:
- name: php
image: {{.Values.image.php.repository}}:{{.Values.image.tag}}
imagePullPolicy: IfNotPresent
securityContext:
capabilities:
add:
- SYS_PTRACE
lifecycle:
preStop:
exec:
command: [
"sh", "-c",
# Introduce a delay to the shutdown sequence to wait for the
# pod eviction event to propagate. Then, gracefully shutdown
# php. The php-fpm children should handle open requests before full shutdown (see process_control_timeout in start.sh)
"sleep 30; kill -QUIT 1",
]
readinessProbe:
exec:
command:
- php
- /var/www/health_check.php
initialDelaySeconds: 15
periodSeconds: 5
timeoutSeconds: 3
volumeMounts:
- name: php-socket
mountPath: /sock
Volume:
volumes:
- name: php-socket
emptyDir: {}
I've deliberately tried artificially raising the nginx container's initialDelaySeconds, but that doesn't seem to impact the 500s at all. Am I missing something dumb? We recently "moved" to a new gke cluster. But I doubt this would impact anything? The kubernetes version isn't even much higher. Otherwise, nothing major has been changed on our end, and I thiiink we would have noticed this problem over the years lol
I am using GKE, and because I have a VPC enabled cluster, the BackendConfigs were using NEG to route traffic directly to pods, completely bypassing the usual kubernetes service behavior I was used to.
I had to declare my own backend config instead of allowing the default version, and I also had to instruct my service to use it of course.
This change was absolutely brutal to unpack. I'm confused that I haven't been able to find documentation very easily, and not much help from AI.