POD Definition - Deploying to DC/OS

I'm new to DC/OS and I have been really struggling trying to deploy a POD. I have tried the simple examples provided in the documentation but the deployments remain stuck in the deploying stage. There are plenty of resources available so that is not the issue.

I have 3 containers that I need to exist within a virtual network (queue, PDI, API). I have included my definition file that starts with a single container deployment and once I can successfully deploy I will add 2 additional containers to the definition. I have been looking at this example but have been unsuccessful.

I have successfully deployed the containers one at a time through Jenkins. All 3 images have been published and exist in the docker registry (Jfrog). I have included an example of my marathon.json for one of those successful deployments. I would appreciate any feedback that can help. The service is stuck in a deployed stage so I'm unable to drill down and see the logs via the command line or UI.

containers.image = pdi-queue

artifactory server = repos.pdi.com:5010/pdi-queue

1 Container POD Definition - (Error: Stuck in Deployment Stage)

{
"id":"/pdi-queue",
"containers":[
   {
      "name":"simple-docker",
      "resources":{
         "cpus":1,
         "mem":128,
         "disk":0,
         "gpus":0
      },
      "image":{
         "kind":"DOCKER",
         "id":"repos.pdi.com:5010/pdi-queue",
         "portMappings":[
            {
               "hostPort": 0,
               "containerPort": 15672,
               "protocol": "tcp",
               "servicePort": 15672

            }
         ]
      },
      "endpoints":[
         {
            "name":"web",
            "containerPort":80,
            "protocol":[
               "http"
            ]
         }

      ],
      "healthCheck":{
         "http":{
            "endpoint":"web",
            "path":"/"
         }
      }
   }
],
"networks":[
   {
      "mode":"container",
      "name":"dcos"
   }
]

}

Marathon.json - (No Error: Successful deployment)

  {
  "id": "/pdi-queue",
  "backoffFactor": 1.15,
  "backoffSeconds": 1,
  "container": {
    "portMappings": [
      {"containerPort": 15672, "hostPort": 0, "protocol": "tcp", "servicePort": 15672, "name": "health"},
      {"containerPort": 5672, "hostPort": 0, "protocol": "tcp", "servicePort": 5672, "name": "queue"}
    ],
    "type": "DOCKER",
    "volumes": [],
    "docker": {
      "image": "repos.pdi.com:5010/pdi-queue",
      "forcePullImage": true,
      "privileged": false,
      "parameters": []
    }
  },
  "cpus": 0.1,
  "disk": 0,
  "healthChecks": [
    {
      "gracePeriodSeconds": 300,
      "intervalSeconds": 60,
      "maxConsecutiveFailures": 3,
      "portIndex": 0,
      "timeoutSeconds": 20,
      "delaySeconds": 15,
      "protocol": "MESOS_HTTP",
      "path": "/"
    }
  ],
  "instances": 1,
  "maxLaunchDelaySeconds": 3600,
  "mem": 512,
  "gpus": 0,
  "networks": [
    {
      "mode": "container/bridge"
    }
  ],
  "requirePorts": false,
  "upgradeStrategy": {
    "maximumOverCapacity": 1,
    "minimumHealthCapacity": 1
  },
  "killSelection": "YOUNGEST_FIRST",
  "unreachableStrategy": {
    "inactiveAfterSeconds": 300,
    "expungeAfterSeconds": 600
  },
  "fetch": [],
  "constraints": [],
  "labels": {
    "traefik.frontend.redirect.entryPoint": "https",
    "traefik.frontend.redirect.permanent": "true",
    "traefik.enable": "true"
  }

}

Solution

I may not know the answer to the issues you are running into but I think I may be able to share some pointers to help debug this.

First of all, if you are unable to view logs from the DC/OS UI, you can also go to <cluster_url>/mesos and find the simple_docker task under Completed Tasks . It would show up as TASK_FAILED. Click on the Sandbox link on the right and then check stderr and stdout files for the task. There might be some clues there as to why it failed.

Another place to look can be to note the Agent IP from the Mesos UI where the task failed. SSH into the node and run sudo journalctl -u dcos-mesos-slave to see agent logs and try to find the logs corresponding to the failing task

One difference between the running the application as a Pod and a the App definition you shared is that your app definition is using DOCKER as the containerizer for the task while Pods use MESOS containerizer. I noticed that you are using a private docker registry for your docker images. One possibility is that if your private registry's certificate is not trusted by Mesos but docker is configured already to trust it:

<copy the certificate(s) to /var/lib/dcos/pki/tls/certs>
cd /var/lib/dcos/pki/tls/certs
for file in *.crt; do ln -s \"$file\" \"$(openssl x509 -hash -noout -in \"$file\")\".0; done

This would need to be done on each agent node.

If its not a certificate issue, it could be docker registry credential issues. If the docker registry you are using requires authentication then you can specify docker credential at install time (assuming advanced install method) using : https://docs.mesosphere.com/1.11/installing/production/advanced-configuration/configuration-reference/#cluster-docker-credentials