How to Import Streamsets pipeline in Dockerfile without container exiting

I am trying to import a pipeline into streamsets, during container start up, by using the Docker CMD command in Dockerfile. The image builds, but while creating the container there is no error but it exits with code 0. So it never comes up. Here is what I did:

Dockerfile:

FROM streamsets/datacollector:3.18.1

COPY myPipeline.json /pipelinejsonlocation/

EXPOSE 18630

ENTRYPOINT ["/bin/sh"]
CMD ["/opt/streamsets-datacollector-3.18.1/bin/streamsets","cli","-U", "http://localhost:18630", \
    "-u", \
    "admin", \ 
    "-p", \ 
    "admin",  \
    "store",  \
    "import",  \
    "-n", \
    "myPipeline", \
    "--stack", \ 
    "-f",  \
    "/pipelinejsonlocation/myPipeline.json"]

Build image:

docker build -t cmp/sdc .

Run image:

docker run -p 18630:18630 -d --name sdc cmp/sdc

This outputs the container id. But the container is in the Exited status as shown below.

    docker ps -a
    CONTAINER ID  IMAGE        COMMAND                  CREATED             STATUS                     PORTS   NAMES
    537adb1b05ab  cmp/sdc     "/bin/sh /opt/stream…"   5 seconds ago       Exited (0) 3 seconds ago           sdc

When I do not specify the CMD command in the Dockerfile, the streamsets container spins up and then when I run the streamsets import command in the running container in shell, it works. But how do I get it done during provisioning itself? Is there something I am missing in the Dockerfile?

Solution

In your Dockerfile you overwrite the default CMD and ENTRYPOINT from the StreamSets Data Collector Dockerfile. So the container only executes your command during startup and exits without errors afterwards. This is the reason why your container is in Exited (0) status.

In general this is good and expected behavior. If you want to keep your container alive you need to execute another command in the foreground, which never ends. But unfortunately, you cannot run multiple CMDs in your docker file.

I dug a little deeper. The default entry point of the image is ENTRYPOINT ["/docker-entrypoint.sh"]. This script sets up a few things and starts the Data Collector.

It is required that the Data Collector is running before the pipeline is imported. So a solution could be to copy the default docker-entrypoint.sh and modify it to start the Data Collector and import the pipeline afterwards. You could to it like this:

Dockerfile:

FROM streamsets/datacollector:3.18.1

COPY myPipeline.json /pipelinejsonlocation/
# Replace docker-entrypoint.sh
COPY docker-entrypoint.sh /docker-entrypoint.sh 

EXPOSE 18630

docker-entrypoint.sh (https://github.com/streamsets/datacollector-docker/blob/master/docker-entrypoint.sh):

#!/bin/bash
#
# Copyright 2017 StreamSets Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#

set -e

# We translate environment variables to sdc.properties and rewrite them.
set_conf() {
  if [ $# -ne 2 ]; then
    echo "set_conf requires two arguments: <key> <value>"
    exit 1
  fi

  if [ -z "$SDC_CONF" ]; then
    echo "SDC_CONF is not set."
    exit 1
  fi

  grep -q "^$1" ${SDC_CONF}/sdc.properties && sed 's|^#\?\('"$1"'=\).*|\1'"$2"'|' -i ${SDC_CONF}/sdc.properties || echo -e "\n$1=$2" >> ${SDC_CONF}/sdc.properties
}

# support arbitrary user IDs
# ref: https://docs.openshift.com/container-platform/3.3/creating_images/guidelines.html#openshift-container-platform-specific-guidelines
if ! whoami &> /dev/null; then
  if [ -w /etc/passwd ]; then
    echo "${SDC_USER:-sdc}:x:$(id -u):0:${SDC_USER:-sdc} user:${HOME}:/sbin/nologin" >> /etc/passwd
  fi
fi

# In some environments such as Marathon $HOST and $PORT0 can be used to
# determine the correct external URL to reach SDC.
if [ ! -z "$HOST" ] && [ ! -z "$PORT0" ] && [ -z "$SDC_CONF_SDC_BASE_HTTP_URL" ]; then
  export SDC_CONF_SDC_BASE_HTTP_URL="http://${HOST}:${PORT0}"
fi

for e in $(env); do
  key=${e%=*}
  value=${e#*=}
  if [[ $key == SDC_CONF_* ]]; then
    lowercase=$(echo $key | tr '[:upper:]' '[:lower:]')
    key=$(echo ${lowercase#*sdc_conf_} | sed 's|_|.|g')
    set_conf $key $value
  fi
done

# MODIFICATIONS:
#exec "${SDC_DIST}/bin/streamsets" "$@"

check_data_collector_status () {
   watch -n 1 ${SDC_DIST}/bin/streamsets cli -U http://localhost:18630 ping | grep -q 'version' && echo "Data Collector has started!" && import_pipeline
}

function import_pipeline () {
    sleep 1

    echo "Start to import pipeline"
    ${SDC_DIST}/bin/streamsets cli -U http://localhost:18630 -u admin -p admin store import -n myPipeline --stack -f /pipelinejsonlocation/myPipeline.json

    echo "Finished importing pipeline"
}

# Start checking if Data Collector is up (in background) and start Data Collector
check_data_collector_status & ${SDC_DIST}/bin/streamsets $@

I commented out the last line exec "${SDC_DIST}/bin/streamsets" "$@" of the default docker-entrypoint.sh and added two functions. check_data_collector_status () pings the Data Collector service until it is available. import_pipeline () imports your pipeline.

check_data_collector_status () runs in background and ${SDC_DIST}/bin/streamsets $@ is started in foreground as before. So the pipeline is imported after the Data Collector service is started.