I can't create pipelines. I can't even load the samples / tutorials on the AI Platform Pipelines Dashboard because it doesn't seem to be able to proxy to whatever it needs to.
An error occurred
Error occured while trying to proxy to: ...
I looked into the cluster's details and found 3 components with errors:
Deployment metadata-grpc-deployment Does not have minimum availability
Deployment ml-pipeline Does not have minimum availability
Deployment ml-pipeline-persistenceagent Does not have minimum availability
Creating the clusters involve approx. 3 clicks in GCP Kubernetes Engine so I don't think I messed up this step.
Anyone have an idea of how to achieve "minimum availability"?
UPDATE 1
Nodes have adequate resources and are Ready. YAML file looks good. I have 2 clusters in diff regions/zones and both have the deployment errors listed above. 2 Pods are not ok.
Name: ml-pipeline-65479485c8-mcj9x
Namespace: default
Priority: 0
Node: gke-cluster-3-default-pool-007784cb-qcsn/10.150.0.2
Start Time: Thu, 17 Sep 2020 22:15:19 +0000
Labels: app=ml-pipeline
app.kubernetes.io/name=kubeflow-pipelines-3
pod-template-hash=65479485c8
Annotations: kubernetes.io/limit-ranger: LimitRanger plugin set: cpu request for container ml-pipeline-api-server
Status: Running
IP: 10.4.0.8
IPs:
IP: 10.4.0.8
Controlled By: ReplicaSet/ml-pipeline-65479485c8
Containers:
ml-pipeline-api-server:
Container ID: ...
Image: ...
Image ID: ...
Ports: 8888/TCP, 8887/TCP
Host Ports: 0/TCP, 0/TCP
State: Running
Started: Fri, 18 Sep 2020 10:27:31 +0000
Last State: Terminated
Reason: Error
Exit Code: 255
Started: Fri, 18 Sep 2020 10:20:38 +0000
Finished: Fri, 18 Sep 2020 10:27:31 +0000
Ready: False
Restart Count: 98
Requests:
cpu: 100m
Liveness: exec [wget -q -S -O - http://localhost:8888/apis/v1beta1/healthz] delay=3s timeout=2s period=5s #success=1 #failure=3
Readiness: exec [wget -q -S -O - http://localhost:8888/apis/v1beta1/healthz] delay=3s timeout=2s period=5s #success=1 #failure=3
Environment:
HAS_DEFAULT_BUCKET: true
BUCKET_NAME:
PROJECT_ID: <set to the key 'project_id' of config map 'gcp-default-config'> Optional: false
POD_NAMESPACE: default (v1:metadata.namespace)
DEFAULTPIPELINERUNNERSERVICEACCOUNT: pipeline-runner
OBJECTSTORECONFIG_SECURE: false
OBJECTSTORECONFIG_BUCKETNAME:
DBCONFIG_DBNAME: kubeflow_pipelines_3_pipeline
DBCONFIG_USER: <set to the key 'username' in secret 'mysql-credential'> Optional: false
DBCONFIG_PASSWORD: <set to the key 'password' in secret 'mysql-credential'> Optional: false
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from ml-pipeline-token-77xl8 (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
ml-pipeline-token-77xl8:
Type: Secret (a volume populated by a Secret)
SecretName: ml-pipeline-token-77xl8
Optional: false
QoS Class: Burstable
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning BackOff 52m (x409 over 11h) kubelet, gke-cluster-3-default-pool-007784cb-qcsn Back-off restarting failed container
Warning Unhealthy 31m (x94 over 12h) kubelet, gke-cluster-3-default-pool-007784cb-qcsn Readiness probe failed:
Warning Unhealthy 31m (x29 over 10h) kubelet, gke-cluster-3-default-pool-007784cb-qcsn (combined from similar events): Readiness probe failed: c
annot exec in a stopped state: unknown
Warning Unhealthy 17m (x95 over 12h) kubelet, gke-cluster-3-default-pool-007784cb-qcsn Liveness probe failed:
Normal Pulled 7m26s (x97 over 12h) kubelet, gke-cluster-3-default-pool-007784cb-qcsn Container image "gcr.io/cloud-marketplace/google-cloud-ai
-platform/kubeflow-pipelines/apiserver:1.0.0" already present on machine
Warning Unhealthy 75s (x78 over 12h) kubelet, gke-cluster-3-default-pool-007784cb-qcsn Liveness probe errored: rpc error: code = DeadlineExceede
d desc = context deadline exceeded
And the other pod:
Name: ml-pipeline-persistenceagent-67db8b8964-mlbmv
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning BackOff 32s (x2238 over 12h) kubelet, gke-cluster-3-default-pool-007784cb-qcsn Back-off restarting failed container
SOLUTION
Do not let google handle any storage. Uncheck "Use managed storage" and set up your own artifact collections manually. You don't actually need to enter anything in these fields since the pipeline will be launched anyway.
The Does not have minimum availability
error is generic. There could be many issues that trigger it. You need to analyse more in-depth in order to find the actual problem. Here are some possible causes:
Insufficient resources: check if your Node has adequate resources (CPU/Memory). If Node is ok than check the Pod's status.
Liveliness probe and/or Readiness probe failure: execute kubectl describe pod <pod-name>
to check if they failed and why.
Deployment misconfiguration: review your deployment yaml file to see if there are any errors or leftovers from previous configurations.
You can also try to wait a bit as sometimes it takes some time in order to deploy everything and/or try changing your Region/Zone.