grafanaprometheusibm-cloud-private

IBM Cloud Private 2.1.0.1 ee failed with a timeout error while installing Monitoring


I have I have been trying to setup ICP EE in a single node, but I keep getting an installation failure once I get to Deployment of monitoring service task.

This particular task runs for about 30 minutes and then fails. Below is the error log that I got as well.

Is there something I need to do differently?

I used the basic install steps on knowledge center for this.

TASK [monitoring : Deploying monitoring service] 

*******************************
    fatal: [localhost]: FAILED! => {
   "changed":true,
   "cmd":"kubectl apply --force --overwrite=true -f /installer/playbook/..//cluster/cfc-components/monitoring/",
   "delta":"0:30:37.425771",
   "end":"2018-02-26 17:19:04.780643",
   "failed":true,
   "rc":1,
   "start":"2018-02-26 16:48:27.354872",
   "stderr":"Error from server: error when creating \"/installer/cluster/cfc-components/monitoring/grafana-router-config.yaml\": timeout\nError from server (Timeout): error when creating \"/installer/cluster/cfc-components/monitoring/kube-state-metrics-deployment.yaml\": the server was unable to return a response in the time allotted, but may still be processing the request (post deployments.extensions)",
   "stderr_lines":[
      "Error from server: error when creating \"/installer/cluster/cfc-components/monitoring/grafana-router-config.yaml\": timeout",
      "Error from server (Timeout): error when creating \"/installer/cluster/cfc-components/monitoring/kube-state-metrics-deployment.yaml\": the server was unable to return a response in the time allotted, but may still be processing the request (post deployments.extensions)"
   ],
   "stdout":"configmap \"alert-rules\" created\nconfigmap \"monitoring-prometheus-alertmanager\" created\ndeployment \"monitoring-prometheus-alertmanager\" created\nconfigmap \"alertmanager-router-nginx-config\" created\nservice \"monitoring-prometheus-alertmanager\" created\ndeployment \"monitoring-exporter\" created\nservice \"monitoring-exporter\" created\nconfigmap \"monitoring-grafana-config\" created\ndeployment \"monitoring-grafana\" created\nconfigmap \"grafana-entry-config\" created\nservice \"monitoring-grafana\" created\njob \"monitoring-grafana-ds\" created\nconfigmap \"grafana-ds-entry-config\" created\nservice \"monitoring-prometheus-kubestatemetrics\" created\ndaemonset \"monitoring-prometheus-nodeexporter-amd64\" created\ndaemonset \"monitoring-prometheus-nodeexporter-ppc64le\" created\ndaemonset \"monitoring-prometheus-nodeexporter-s390x\" created\nservice \"monitoring-prometheus-nodeexporter\" created\nconfigmap \"monitoring-prometheus\" created\ndeployment \"monitoring-prometheus\" created\nconfigmap \"prometheus-router-nginx-config\" created\nservice \"monitoring-prometheus\" created\nconfigmap \"monitoring-router-entry-config\" created",
   "stdout_lines":[
      "configmap \"alert-rules\" created",
      "configmap \"monitoring-prometheus-alertmanager\" created",
      "deployment \"monitoring-prometheus-alertmanager\" created",
      "configmap \"alertmanager-router-nginx-config\" created",
      "service \"monitoring-prometheus-alertmanager\" created",
      "deployment \"monitoring-exporter\" created",
      "service \"monitoring-exporter\" created",
      "configmap \"monitoring-grafana-config\" created",
      "deployment \"monitoring-grafana\" created",
      "configmap \"grafana-entry-config\" created",
      "service \"monitoring-grafana\" created",
      "job \"monitoring-grafana-ds\" created",
      "configmap \"grafana-ds-entry-config\" created",
      "service \"monitoring-prometheus-kubestatemetrics\" created",
      "daemonset \"monitoring-prometheus-nodeexporter-amd64\" created",
      "daemonset \"monitoring-prometheus-nodeexporter-ppc64le\" created",
      "daemonset \"monitoring-prometheus-nodeexporter-s390x\" created",
      "service \"monitoring-prometheus-nodeexporter\" created",
      "configmap \"monitoring-prometheus\" created",
      "deployment \"monitoring-prometheus\" created",
      "configmap \"prometheus-router-nginx-config\" created",
      "service \"monitoring-prometheus\" created",
      "configmap \"monitoring-router-entry-config\" created"
   ]
}

Solution

  • Does this node have at least 16G of memory (or even 32G)? It may be that the host is overwhelmed by the initial load as pods are coming online.

    The second thing to test is what happens when you apply this directory:

    kubectl -n kube-system get pod -o wide

    1. Are pods stuck in non-Running states?
    2. Are containers within pods not starting (e.g. showing 0/2 or 1/3 or similar)?

    journalctl -ru kubelet -o cat | head -n 500 > kubelet-logs.txt

    1. Does the kubelet complain about being able to start containers?
    2. Does the kubelet complain about Docker being unhealthy?

    3. If some pod demonstrates it is unhealthy (above from #1/#2), then describe it and verify if any of the events indicate why it is failing:

    kubectl -n kube-system describe pod [failing-pod-name]

    If you haven't already configured kubectl on the host to interact with the system, or if the auth-idp pod has not yet deployed, you can use the following steps to configure kubectl:

    docker run -e LICENSE=accept -v /usr/local/bin:/data \ ibmcom/icp-inception:[YOUR_VERSION] \ cp /usr/local/bin/kubectl /data export KUBECONFIG=/var/lib/kubelet/kubelet-config