argo-workflowsargoproj

Argo sample workflows stuck in the pending state


I follow the Argo Workflow's Getting Started documentation. Everything goes smooth until I run the first sample workflow as described in 4. Run Sample Workflows. The workflow just gets stuck in the pending state:

vagrant@master:~$ argo submit --watch https://raw.githubusercontent.com/argoproj/argo/master/examples/hello-world.yaml
Name:                hello-world-z4lbs
Namespace:           default
ServiceAccount:      default
Status:              Pending
Created:             Thu May 14 12:36:45 +0000 (now)

vagrant@master:~$ argo list
NAME                STATUS    AGE   DURATION   PRIORITY
hello-world-z4lbs   Pending   27m   0s         0

Here it was mentioned that taints on the muster node may be the problem, so I untainted the master node:

vagrant@master:~$ kubectl taint nodes --all node-role.kubernetes.io/master-
node/master untainted
taint "node-role.kubernetes.io/master" not found
taint "node-role.kubernetes.io/master" not found

Then I deleted the pending workflow and resubmitted it, but it got stuck in the pending state again.

The details of the newly submitted workflow that is also stuck:

vagrant@master:~$ kubectl describe workflow hello-world-8kvmb
Name:         hello-world-8kvmb
Namespace:    default
Labels:       <none>
Annotations:  <none>
API Version:  argoproj.io/v1alpha1
Kind:         Workflow
Metadata:
  Creation Timestamp:  2020-05-14T13:57:44Z
  Generate Name:       hello-world-
  Generation:          1
  Managed Fields:
    API Version:  argoproj.io/v1alpha1
    Fields Type:  FieldsV1
    fieldsV1:
      f:metadata:
        f:generateName:
      f:spec:
        .:
        f:arguments:
        f:entrypoint:
        f:templates:
      f:status:
        .:
        f:finishedAt:
        f:startedAt:
    Manager:         argo
    Operation:       Update
    Time:            2020-05-14T13:57:44Z
  Resource Version:  16780
  Self Link:         /apis/argoproj.io/v1alpha1/namespaces/default/workflows/hello-world-8kvmb
  UID:               aa82d005-b7ac-411f-9d0b-93f34876b673
Spec:
  Arguments:
  Entrypoint:  whalesay
  Templates:
    Arguments:
    Container:
      Args:
        hello world
      Command:
        cowsay
      Image:  docker/whalesay:latest
      Name:   
      Resources:
    Inputs:
    Metadata:
    Name:  whalesay
    Outputs:
Status:
  Finished At:  <nil>
  Started At:   <nil>
Events:         <none>

While trying to get the workflow-controller logs I get the follwoing error:

vagrant@master:~$ kubectl logs -n argo -l app=workflow-controller
Error from server (BadRequest): container "workflow-controller" in pod "workflow-controller-6c4787844c-lbksm" is waiting to start: ContainerCreating

The details for the corresponding workflow-controller pod:

vagrant@master:~$ kubectl -n argo describe pods/workflow-controller-6c4787844c-lbksm
Name:           workflow-controller-6c4787844c-lbksm
Namespace:      argo
Priority:       0
Node:           node-1/192.168.50.11
Start Time:     Thu, 14 May 2020 12:08:29 +0000
Labels:         app=workflow-controller
                pod-template-hash=6c4787844c
Annotations:    <none>
Status:         Pending
IP:             
IPs:            <none>
Controlled By:  ReplicaSet/workflow-controller-6c4787844c
Containers:
  workflow-controller:
    Container ID:  
    Image:         argoproj/workflow-controller:v2.8.0
    Image ID:      
    Port:          <none>
    Host Port:     <none>
    Command:
      workflow-controller
    Args:
      --configmap
      workflow-controller-configmap
      --executor-image
      argoproj/argoexec:v2.8.0
    State:          Waiting
      Reason:       ContainerCreating
    Ready:          False
    Restart Count:  0
    Environment:    <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from argo-token-pz4fd (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  argo-token-pz4fd:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  argo-token-pz4fd
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type     Reason                  Age                      From             Message
  ----     ------                  ----                     ----             -------
  Normal   SandboxChanged          7m17s (x4739 over 112m)  kubelet, node-1  Pod sandbox changed, it will be killed and re-created.
  Warning  FailedCreatePodSandBox  2m18s (x4950 over 112m)  kubelet, node-1  (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "1bd1fd11dfe677c749b4a1260c29c2f8cff0d55de113d154a822e68b41f9438e" network for pod "workflow-controller-6c4787844c-lbksm": networkPlugin cni failed to set up pod "workflow-controller-6c4787844c-lbksm_argo" network: stat /var/lib/calico/nodename: no such file or directory: check that the calico/node container is running and has mounted /var/lib/calico/

I run Argo 2.8:

vagrant@master:~$ argo version
argo: v2.8.0
  BuildDate: 2020-05-11T22:55:16Z
  GitCommit: 8f696174746ed01b9bf1941ad03da62d312df641
  GitTreeState: clean
  GitTag: v2.8.0
  GoVersion: go1.13.4
  Compiler: gc
  Platform: linux/amd64

I have checked the cluster status and it looks OK:

vagrant@master:~$ kubectl get nodes
NAME     STATUS   ROLES    AGE   VERSION
master   Ready    master   95m   v1.18.2
node-1   Ready    <none>   92m   v1.18.2
node-2   Ready    <none>   92m   v1.18.2

As to the K8s cluster installation, I created it using Vagrant as described here, the only differences being:

Any idea why the workflows get stuck in the pending state and how to fix it?


Solution

  • Workflows start in the Pending state and then are moved through their steps by the workflow-controller pod (which is installed in the cluster as part of Argo).

    The workflow-controller pod is stuck in ContainerCreating. kubectl describe po {workflow-controller pod} reveals a Calico-related network error.

    As mentioned in the comments, it looks like a common Calico error. Once you clear that up, your hello-world workflow should execute just fine.

    Note from OP: Further debugging confirms the Calico problem (Calico nodes are not in the running state):

    vagrant@master:~$ kubectl get pods --all-namespaces
    NAMESPACE     NAME                                       READY   STATUS              RESTARTS   AGE
    argo          argo-server-84946785b-94bfs                0/1     ContainerCreating   0          3h59m
    argo          workflow-controller-6c4787844c-lbksm       0/1     ContainerCreating   0          3h59m
    kube-system   calico-kube-controllers-74d45555dd-zhkp6   0/1     CrashLoopBackOff    56         3h59m
    kube-system   calico-node-2n9kt                          0/1     CrashLoopBackOff    72         3h59m
    kube-system   calico-node-b8sb8                          0/1     Running             70         3h56m
    kube-system   calico-node-pslzs                          0/1     CrashLoopBackOff    67         3h56m
    kube-system   coredns-66bff467f8-rmxsp                   0/1     ContainerCreating   0          3h59m
    kube-system   coredns-66bff467f8-z4lbq                   0/1     ContainerCreating   0          3h59m
    kube-system   etcd-master                                1/1     Running             2          3h59m
    kube-system   kube-apiserver-master                      1/1     Running             2          3h59m
    kube-system   kube-controller-manager-master             1/1     Running             2          3h59m
    kube-system   kube-proxy-k59ks                           1/1     Running             2          3h59m
    kube-system   kube-proxy-mn96x                           1/1     Running             1          3h56m
    kube-system   kube-proxy-vxj8b                           1/1     Running             1          3h56m
    kube-system   kube-scheduler-master                      1/1     Running             2          3h59m