I have a local Kubernetes created by Rancher Desktop. I am trying to deploy based on the installation guide.
However, after deploying by
kubectl apply -k "github.com/kubeflow/training-operator/manifests/overlays/standalone?ref=v1.6.0"
The Kubeflow Training Operator pod is in a CrashLoopBackOff state, with the following log:
➜ kubectl logs training-operator-xxx -n kubeflow
I0714 04:54:03.434723 1 request.go:682] Waited for 1.024840626s due to client-side throttling, not priority and fairness, request: GET:https://10.43.0.1:443/apis/packages.operators.coreos.com/v1?timeout=32s
1.689310446978421e+09 INFO controller-runtime.metrics Metrics server is starting to listen {"addr": ":8080"}
I0714 04:54:14.225698 1 request.go:682] Waited for 1.047503167s due to client-side throttling, not priority and fairness, request: GET:https://10.43.0.1:443/apis/node.k8s.io/v1?timeout=32s
I0714 04:54:24.275500 1 request.go:682] Waited for 1.948469293s due to client-side throttling, not priority and fairness, request: GET:https://10.43.0.1:443/apis/artifact.apicur.io/v1alpha1?timeout=32s
I0714 04:54:34.325909 1 request.go:682] Waited for 2.849523377s due to client-side throttling, not priority and fairness, request: GET:https://10.43.0.1:443/apis/operators.coreos.com/v1?timeout=32s
I0714 04:54:44.724674 1 request.go:682] Waited for 1.047644251s due to client-side throttling, not priority and fairness, request: GET:https://10.43.0.1:443/apis/operators.coreos.com/v1?timeout=32s
I0714 04:54:54.774273 1 request.go:682] Waited for 1.947402376s due to client-side throttling, not priority and fairness, request: GET:https://10.43.0.1:443/apis/elasticsearch.k8s.elastic.co/v1?timeout=32s
Any ideas? Thanks!
It turns out the Kubeflow Training Operator pod requires an additional startup time on their first initialization.
So we can patch by adding startupProbe
with higher failureThreshold
.
Here is the working version:
kubectl apply --kustomize=kubeflow-training-operator
kubeflow-training-operator/kustomization.yaml
resources:
- github.com/kubeflow/training-operator/manifests/overlays/standalone?ref=v1.6.0
patches:
- path: training-operator-deployment-patch.yaml
kubeflow-training-operator/training-operator-deployment-patch.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: training-operator
spec:
template:
spec:
containers:
- name: training-operator
startupProbe:
httpGet:
path: /healthz
port: 8081
failureThreshold: 30
Then I can see it got deployed properly:
➜ kubectl logs training-operator-xxx -n kubeflow
I0714 05:41:03.499300 1 request.go:682] Waited for 1.041055501s due to client-side throttling, not priority and fairness, request: GET:https://10.43.0.1:443/apis/monitoring.coreos.com/v1?timeout=32s
1.6893132670509896e+09 INFO controller-runtime.metrics Metrics server is starting to listen {"addr": ":8080"}
I0714 05:41:14.295944 1 request.go:682] Waited for 1.048299708s due to client-side throttling, not priority and fairness, request: GET:https://10.43.0.1:443/apis/registry.apicur.io/v1?timeout=32s
I0714 05:41:24.296220 1 request.go:682] Waited for 1.898473292s due to client-side throttling, not priority and fairness, request: GET:https://10.43.0.1:443/apis/admissionregistration.k8s.io/v1?timeout=32s
I0714 05:41:34.345574 1 request.go:682] Waited for 2.797829418s due to client-side throttling, not priority and fairness, request: GET:https://10.43.0.1:443/apis/kafka.strimzi.io/v1beta1?timeout=32s
I0714 05:41:44.795474 1 request.go:682] Waited for 1.048790793s due to client-side throttling, not priority and fairness, request: GET:https://10.43.0.1:443/apis/operators.coreos.com/v1alpha2?timeout=32s
I0714 05:41:54.845438 1 request.go:682] Waited for 1.945571376s due to client-side throttling, not priority and fairness, request: GET:https://10.43.0.1:443/apis/monitoring.coreos.com/v1alpha1?timeout=32s
I0714 05:42:04.846740 1 request.go:682] Waited for 2.798720251s due to client-side throttling, not priority and fairness, request: GET:https://10.43.0.1:443/apis/apm.k8s.elastic.co/v1beta1?timeout=32s
I0714 05:42:15.295114 1 request.go:682] Waited for 1.047853292s due to client-side throttling, not priority and fairness, request: GET:https://10.43.0.1:443/apis/node.k8s.io/v1?timeout=32s
1.689313340247459e+09 INFO setup starting manager
1.6893133402522147e+09 INFO Starting server {"kind": "health probe", "addr": "[::]:8081"}
1.6893133402523057e+09 INFO Starting server {"path": "/metrics", "kind": "metrics", "addr": "[::]:8080"}
1.6893133402535763e+09 INFO Starting EventSource {"controller": "paddlejob-controller", "source": "kind source: *v1.PaddleJob"}
1.6893133402535298e+09 INFO Starting EventSource {"controller": "mxjob-controller", "source": "kind source: *v1.MXJob"}
1.6893133402539213e+09 INFO Starting EventSource {"controller": "paddlejob-controller", "source": "kind source: *v1.Pod"}
1.6893133402539287e+09 INFO Starting EventSource {"controller": "mxjob-controller", "source": "kind source: *v1.Pod"}
1.6893133402534788e+09 INFO Starting EventSource {"controller": "tfjob-controller", "source": "kind source: *v1.TFJob"}
1.689313340253958e+09 INFO Starting EventSource {"controller": "tfjob-controller", "source": "kind source: *v1.Pod"}
1.6893133402535348e+09 INFO Starting EventSource {"controller": "pytorchjob-controller", "source": "kind source: *v1.PyTorchJob"}
1.6893133402540202e+09 INFO Starting EventSource {"controller": "pytorchjob-controller", "source": "kind source: *v1.Pod"}
1.6893133402534842e+09 INFO Starting EventSource {"controller": "xgboostjob-controller", "source": "kind source: *v1.XGBoostJob"}
1.6893133402540386e+09 INFO Starting EventSource {"controller": "xgboostjob-controller", "source": "kind source: *v1.Pod"}
1.6893133402540817e+09 INFO Starting EventSource {"controller": "pytorchjob-controller", "source": "kind source: *v1.Service"}
1.6893133402540936e+09 INFO Starting Controller {"controller": "pytorchjob-controller"}
1.6893133402541058e+09 INFO Starting EventSource {"controller": "tfjob-controller", "source": "kind source: *v1.Service"}
1.689313340254115e+09 INFO Starting Controller {"controller": "tfjob-controller"}
1.6893133402534952e+09 INFO Starting EventSource {"controller": "mpijob-controller", "source": "kind source: *v1.MPIJob"}
1.6893133402541409e+09 INFO Starting EventSource {"controller": "mpijob-controller", "source": "kind source: *v1.Pod"}
1.6893133402541444e+09 INFO Starting EventSource {"controller": "mpijob-controller", "source": "kind source: *v1.ConfigMap"}
1.6893133402542117e+09 INFO Starting EventSource {"controller": "mxjob-controller", "source": "kind source: *v1.Service"}
1.6893133402542229e+09 INFO Starting Controller {"controller": "mxjob-controller"}
1.6893133402542326e+09 INFO Starting EventSource {"controller": "paddlejob-controller", "source": "kind source: *v1.Service"}
1.6893133402542348e+09 INFO Starting Controller {"controller": "paddlejob-controller"}
1.6893133402542171e+09 INFO Starting EventSource {"controller": "mpijob-controller", "source": "kind source: *v1.Role"}
1.6893133402542467e+09 INFO Starting EventSource {"controller": "mpijob-controller", "source": "kind source: *v1.RoleBinding"}
1.6893133402542505e+09 INFO Starting EventSource {"controller": "mpijob-controller", "source": "kind source: *v1.ServiceAccount"}
1.6893133402542531e+09 INFO Starting Controller {"controller": "mpijob-controller"}
1.689313340254083e+09 INFO Starting EventSource {"controller": "xgboostjob-controller", "source": "kind source: *v1.Service"}
1.6893133402546306e+09 INFO Starting Controller {"controller": "xgboostjob-controller"}
1.6893133403579748e+09 INFO Starting workers {"controller": "paddlejob-controller", "worker count": 1}
1.6893133403599951e+09 INFO Starting workers {"controller": "xgboostjob-controller", "worker count": 1}
1.6893133403601058e+09 INFO Starting workers {"controller": "pytorchjob-controller", "worker count": 1}
1.6893133403601074e+09 INFO Starting workers {"controller": "mxjob-controller", "worker count": 1}
1.6893133403601336e+09 INFO Starting workers {"controller": "tfjob-controller", "worker count": 1}
1.6893133403601432e+09 INFO Starting workers {"controller": "mpijob-controller", "worker count": 1}
References: