[SOLVED] Error pulling docker image from GCR into GKE "Failed to pull image .... 403 Forbidden"

Error pulling docker image from GCR into GKE "Failed to pull image .... 403 Forbidden"

Background:

I have a GKE cluster which has suddenly stopped being able to pull my docker images from GCR; both are in the same GCP project. It has been working well for several months, no issues pulling images, and has now started throwing errors without having made any changes.

(NB: I'm generally the only one on my team who accesses Google Cloud, though it's entirely possible that someone else on my team may have made changes / inadvertently made changes without realising).

I've seen a few other posts on this topic, but the solutions offered in others haven't helped. Two of these posts stood out to me in particular, as they were both posted around the same day my issues started ~13/14 days ago. Whether this is coincidence or not who knows..

This post has the same issue as me; unsure whether the posted comments helped them resolve, but it hasn't fixed for me. This post seemed to also be the same issue, but the poster says it resolved by itself after waiting some time.

The Issue:

I first noticed the issue on the cluster a few days ago. Went to deploy a new image by pushing image to GCR and then bouncing the pods kubectl rollout restart deployment.

The pods all then came back with ImagePullBackOff, saying that they couldn't get the image from GCR:

kubectl get pods:

XXX-XXX-XXX     0/1     ImagePullBackOff   0          13d
XXX-XXX-XXX     0/1     ImagePullBackOff   0          13d
XXX-XXX-XXX     0/1     ImagePullBackOff   0          13d
...

kubectl describe pod XXX-XXX-XXX:

Normal   BackOff           20s                kubelet                                Back-off pulling image "gcr.io/<GCP_PROJECT>/XXX:dev-latest"
Warning  Failed            20s                kubelet                                Error: ImagePullBackOff
Normal   Pulling           8s (x2 over 21s)   kubelet                                Pulling image "gcr.io/<GCP_PROJECT>/XXX:dev-latest"
Warning  Failed            7s (x2 over 20s)   kubelet                                Failed to pull image "gcr.io/<GCP_PROJECT>/XXX:dev-latest": rpc error: code = Unknown desc = failed to pull and unpack image "gcr.io/<GCP_PROJECT>/XXX:dev-latest": failed to resolve reference "gcr.io/<GCR_PROJECT>/XXX:dev-latest": unexpected status code [manifests dev-latest]: 403 Forbidden
Warning  Failed            7s (x2 over 20s)   kubelet                                Error: ErrImagePull

Troubleshooting steps followed from other posts:

I know that the image definitely exists in GCR -

I can pull the image to my own machine (also removed all docker images from my machine to confirm it was really pulling)
I can see the tagged image if I look on the GCR UI on chrome.

I've SSH'd into one of the cluster nodes and tried to docker pull manually, with no success:

docker pull gcr.io/<GCP_PROJECT>/XXX:dev-latest
Error response from daemon: unauthorized: You don't have the needed permissions to perform this operation, and you may have invalid credentials. To authenticate your request, follow the steps in: https://cloud.google.com/container-registry/docs/advanced-authentication

(Also did a docker pull of a public mongodb image to confirm that was working, and it's specific to GCR).

So this leads me to believe it's an issue with the service account not having the correct permissions, as in the cloud docs under the 'Error 400/403' section. This seems to suggest that the service account has either been deleted, or edited manually.

During my troubleshooting, I tried to find out exactly which service account GKE was using to pull from GCR. In the steps outlined in the docs, it says that: The name of your Google Kubernetes Engine service account is as follows, where PROJECT_NUMBER is your project number:

service-PROJECT_NUMBER@container-engine-robot.iam.gserviceaccount.com

I found the service account and checked the polices - it did have one for roles/container.serviceAgent, but nothing specifically mentioning kubernetes as I would expect from the description in the docs.. 'the Kubernetes Engine Service Agent role' (unless that is the one they're describing, in which case I'm no better off that before anyway..).

Must not have had the correct roles, so I then followed the steps to re-enable (disable then enable the Kubernetes API). Running cloud projects get-iam-policy <GCP_PROJECT> again and diffing the two outputs (before/after), the only difference is that a service account for '@cloud-filer...' has been deleted.

Thinking maybe the error was something else, I thought I would try spinning up a new cluster. Same error - can't pull images.

Send help..

I've been racking my brains to try to troubleshoot, but I'm now out of ideas! Any and all help much appreciated!

Solution

Have now solved this.

The service account had the correct roles/permissions, but for whatever reason stopped working.

I manually created a key for that service account, added that secret into the kube cluster, and set the service account to use that key.

Still at a loss as to why it wasn't already doing this, or why it stopped working in the first place all of a sudden, but it's working...

Fix was from this guide, from the section starting 'Create & use GCR credentials'.