spring-boot google-cloud-platform google-cloud-build spring-cloud-gcp cloudbuild.yaml

Error with Spring Boot app in Google Cloud Build - Creates working version but reports failed build

I have a Java Spring Boot app that was previously building well, and we are now having issues.

We are using GCP, and the cloud build feature to trigger builds automatically when we push to certain branches in GCP. The goal is for the app to build itself, then deploy to app engine. In various iterations before much trial and error we were doing this successfully.

The app builds and deploys successfully. Meaning if I push code, it builds and works. But the cloud build tool keeps reporting that the build failed.

Our cloudbuild.yaml

steps:
- id: 'Stage app using mvn appengine plugin on mvn cloud build image'   
  name: 'gcr.io/cloud-builders/mvn'
  args: ['package', 'appengine:stage', '-Dapp.stage.appEngineDirectory=src/main/appengine/$_GAE_YAML', '-P cloud-gcp']
  timeout: 1600s
- id: "Deploy to app engine using gcloud image"
  name: 'gcr.io/cloud-builders/gcloud'
  args: ['app', 'deploy', 'target/appengine-staging/app.yaml',
         '-q', '$_GAE_PROMOTE', '-v', '$_GAE_VERSION']
  timeout: 1600s
- id: "Splitting Traffic"
  name: 'gcr.io/cloud-builders/gcloud'
  args: ['app', 'services', 'set-traffic', '--splits', '$_GAE_TRAFFIC']
timeout: 3200s

For reference here is an app.yaml

runtime: java
env: flex
runtime_config:
  jdk: openjdk8
env_variables:
  SPRING_PROFILES_ACTIVE: "dev"
handlers:
  - url: /.*
    script: this field is required, but ignored
    secure: always
manual_scaling:
  instances: 1
resources:
  cpu: 2
  memory_gb: 2
  disk_size_gb: 10
  volumes:
    - name: ramdisk1
      volume_type: tmpfs
      size_gb: 0.5

The first step completes just fine, or seemingly so.

The app becomes available on that specific version and runs just fine.

Here is the current "failure" we are facing, found in the output of the failed builds in the second step:

--------------------------------------------------------------------------------
Updating service [default] (this may take several minutes)...

ERROR: (gcloud.app.deploy) Error Response: [9] An internal error occurred while processing task /app-engine-flex/flex_await_healthy/flex_await_healthy>2021-11-04T14:55:50.087Z257173.in.0:
There was an error while pulling the application's docker image: the image does
not exist, one of the image layers is missing or the default service account
does not have  permission to pull the image. Please check if the image exists.
Also check if the default service account has the role Storage Object Viewer
(roles/storage.objectViewer) to pull images from Google Container
Registry or Artifact Registry Reader (roles/artifactregistry.reader) to pull
images from Artifact Registry. Refer to https://cloud.google.com/container-registry/docs/access-control
in granting access to pull images from GCR. Refer to https://cloud.google.com/artifact-registry/docs/access-control#roles
in granting access to pull images from Artifact Registry.

We have been having pretty consistent issues with the caching of builds, to the point where in the past we push new code and it launches old versions of the code. I think it may all be related.

We have tried clearing the entire container registry cache for the specific version of the app, and that is when this specific issue started occuring. I have a feeling it is just building and launching one version of the app, then going back and trying to launch a different version of the app right on top of that. Looking for a way to at least get more verbose logging but this is mostly where I am stuck.

How do I go about adjusting the "name: 'gcr.io/cloud-builders/gcloud'" step to properly indicate that a deployment worked? Is that the right approach?

Solution

Answering my own question here.

It turns out that the application was deploying but listening on the wrong port. We just added server.port=8080 to the application.properties file and things started working again.

I do believe what Chanseok Oh mentioned in the comment above on my question was also true. Although changing the port seemed to be the one and only thing that solved this.

GCP was trying to do a readiness check, and was getting nothing back. It is unclear why this was related at all to the cache of the artifacts, if at all.