python django docker google-app-engine google-cloud-platform

Google App Engine deployment fails because of failing readiness check

A custom app engine environment fails to start up and it seems to be due to failing health checks. The app has a few custom dependencies (e.g. PostGIS, GDAL) so a few layers on top of the app engine image. It builds successfully and it runs locally in a Docker container.

ERROR: (gcloud.app.deploy) Error Response: [4] Your deployment has failed to become healthy in the allotted time and therefore was rolled back. If you believe this was an error, try adjusting the 'app_start_timeout_sec' setting in the 'readiness_check' section.

The Dockerfile looks as follows (Note: no CMD as entrypoint is defined in docker-compose.yml and app.yaml):

FROM gcr.io/google-appengine/python
ENV PYTHONUNBUFFERED 1
ENV DEBIAN_FRONTEND noninteractive

RUN apt -y update && apt -y upgrade\
    && apt-get install -y software-properties-common \
    && add-apt-repository -y ppa:ubuntugis/ppa \
    && apt -y update \
    && apt-get -y install gdal-bin libgdal-dev python3-gdal  \ 
    && apt-get autoremove -y \
    && apt-get autoclean -y \
    && apt-get clean \
    && rm -rf /var/lib/apt/lists/*

ADD requirements.txt /app/requirements.txt
RUN python3 -m pip install -r /app/requirements.txt 
ADD . /app/
WORKDIR /app

This unfortunately creates an image of a whopping 1.58GB, but the original gcr.io python image starts at 1.05GB, so I don't think the size of the image would or should be a problem.

Running this locally with the following docker-compose.yml config beautifully spins up a container in no time:

version: "3"
services:
  web:
    build: .
    command: gunicorn gisapplication.wsgi --bind 0.0.0.0:8080

So, I would have expected the following app.yaml would do the trick:

runtime: custom
env: flex
entrypoint: gunicorn -b :$PORT gisapplication.wsgi

beta_settings:
    cloud_sql_instances: <sql-db-connection>

runtime_config:
    python_version: 3

No luck. So, as per error above, it seemed to have something to do with the readiness check. Tried increasing the timeout for the app to start (15 mins!) There seemed to have been some issues with health checks previously and rolling back to legacy health checks is not a solution as of Sept 2019.

readiness_check:
    path: "/readiness_check"
    check_interval_sec: 10
    timeout_sec: 10
    failure_threshold: 3
    success_threshold: 3
    app_start_timeout_sec: 900

liveness_check:
    path: "/liveness_check"
    check_interval_sec: 60
    timeout_sec: 4
    failure_threshold: 3
    success_threshold: 2
    initial_delay_sec: 30

Split health checks are definitely on. The output from gcloud beta app describe is:

authDomain: gmail.com
codeBucket: staging.proj-id-000000.appspot.com
databaseType: CLOUD_DATASTORE_COMPATIBILITY
defaultBucket: proj-id-000000.appspot.com
defaultHostname: proj-id-000000.ts.r.appspot.com
featureSettings:
  splitHealthChecks: true
  useContainerOptimizedOs: true
gcrDomain: asia.gcr.io
id: proj-id-000000
locationId: australia-southeast1
name: apps/proj-id-000000
servingStatus: SERVING

That didn't work, so also tried to increase the resources available to the instance and allocated the maximum amount of memory for 1 CPU (6.1GB):

resources:
    cpu: 1
    memory_gb: 6.1
    disk_size_gb: 10

Just to be on the safe side, I added health check endpoints to the app (legacy health checks and the split health checks) - it's a Django app, so this went into the project's urls.py:

path(r'_ah/health/', lambda r: HttpResponse("OK", status=200)),
path(r'readiness_check/', lambda r: HttpResponse("OK", status=200)),
path(r'liveness_check/', lambda r: HttpResponse("OK", status=200)),

So, when I dive into the logs, there seems to be a successful request to /liveness_check from a curl user agent, but the subsequent requests to /readiness_check from GoogleHC agent return a 503 (Service Unavailable)

Shortly after (after 8 failed requests - why 8?) a shutdown trigger seems to be sent of:

2020-07-05 09:00:02.603 AEST Triggering app shutdown handlers.

Any ideas of what is going on here? I think I've pretty much exhausted the options to fix this problem and wonder whether the time wouldn't have been better invested in getting things up and running in Compute/EC2.

ADDENDUM:

in addition to the SO issue linked, I've gone through issues on Google (here and here)

Solution

All right, the Google guys could not help fix it either, but after an epic journey through way too many logs I managed to figure out what the issue is: The Dockerfile needs a CMD statement. While I had assumed this is what the entrypoint in app.yaml was for, it seems App Engine spins up the container with docker run. Therefore, simply adding this line to the Dockerfile fixes it:

CMD gunicorn -b :$PORT gisapplication.wsgi

I also reverted to default health check settings and was able to take the URL paths for health checks out of my app and let the default nginx instance shipped by the Google base container handle those.