[SOLVED] Why is my nginx+php pod accepting traffic too soon?

Why is my nginx+php pod accepting traffic too soon?

I have a multi-container pod that houses an nginx container and a php-fpm container. they communicate over a file sock.

Years ago I learned that each container could isolate its own readiness, and then when both were ready, the whole pod would come online and accept traffic. I'm really not sure why that isn't working anymore.

Now it seems like no matter what, php is succeeding its readiness check but it doesn't appear to actually be online:

2025/04/08 04:29:25 [crit] 13#13: *293 connect() to unix:/sock/php.sock failed (2: No such file or directory) while connecting to upstream,

These errors persist for a couple of minutes:

nginx:

      - name: nginx
        image: {{.Values.image.nginx.repository}}:{{.Values.image.tag}}
        imagePullPolicy: IfNotPresent
        lifecycle:
          preStop:
            exec:
              command: [
                "/bin/sh", "-c",
                # Introduce a delay to the shutdown sequence to wait for the
                # pod eviction event to propagate. Then, gracefully shutdown
                # nginx.
                "sleep 10;",
                "/usr/sbin/nginx -s quit;"
                ,
              ]
        
        readinessProbe:
          httpGet:
            path: /online-status
            port: 8080
          initialDelaySeconds: 35
          timeoutSeconds: 4
          periodSeconds: 5
          successThreshold: 15
          failureThreshold: 2
        volumeMounts:
        - name: php-socket
          mountPath: /sock

php:

      - name: php
        image: {{.Values.image.php.repository}}:{{.Values.image.tag}}
        imagePullPolicy: IfNotPresent
        securityContext:
          capabilities:
            add:
            - SYS_PTRACE
        lifecycle:
          preStop:
            exec:
              command: [
                "sh", "-c",
                # Introduce a delay to the shutdown sequence to wait for the
                # pod eviction event to propagate. Then, gracefully shutdown
                # php. The php-fpm children should handle open requests before full shutdown (see process_control_timeout in start.sh)
                "sleep 30; kill -QUIT 1",
              ]
        readinessProbe:
          exec:
            command:
            - php 
            - /var/www/health_check.php
          initialDelaySeconds: 15
          periodSeconds: 5
          timeoutSeconds: 3
       volumeMounts:
       - name: php-socket
         mountPath: /sock

Volume:

      volumes:
      - name: php-socket
        emptyDir: {}

I've deliberately tried artificially raising the nginx container's initialDelaySeconds, but that doesn't seem to impact the 500s at all. Am I missing something dumb? We recently "moved" to a new gke cluster. But I doubt this would impact anything? The kubernetes version isn't even much higher. Otherwise, nothing major has been changed on our end, and I thiiink we would have noticed this problem over the years lol

Solution

I am using GKE, and because I have a VPC enabled cluster, the BackendConfigs were using NEG to route traffic directly to pods, completely bypassing the usual kubernetes service behavior I was used to.

I had to declare my own backend config instead of allowing the default version, and I also had to instruct my service to use it of course.

This change was absolutely brutal to unpack. I'm confused that I haven't been able to find documentation very easily, and not much help from AI.