dockerdockerfilestreamsets

How to Import Streamsets pipeline in Dockerfile without container exiting


I am trying to import a pipeline into streamsets, during container start up, by using the Docker CMD command in Dockerfile. The image builds, but while creating the container there is no error but it exits with code 0. So it never comes up. Here is what I did:

Dockerfile:

FROM streamsets/datacollector:3.18.1

COPY myPipeline.json /pipelinejsonlocation/

EXPOSE 18630

ENTRYPOINT ["/bin/sh"]
CMD ["/opt/streamsets-datacollector-3.18.1/bin/streamsets","cli","-U", "http://localhost:18630", \
    "-u", \
    "admin", \ 
    "-p", \ 
    "admin",  \
    "store",  \
    "import",  \
    "-n", \
    "myPipeline", \
    "--stack", \ 
    "-f",  \
    "/pipelinejsonlocation/myPipeline.json"]

Build image:

docker build -t cmp/sdc .

Run image:

docker run -p 18630:18630 -d --name sdc cmp/sdc

This outputs the container id. But the container is in the Exited status as shown below.

    docker ps -a
    CONTAINER ID  IMAGE        COMMAND                  CREATED             STATUS                     PORTS   NAMES
    537adb1b05ab  cmp/sdc     "/bin/sh /opt/stream…"   5 seconds ago       Exited (0) 3 seconds ago           sdc 
    

When I do not specify the CMD command in the Dockerfile, the streamsets container spins up and then when I run the streamsets import command in the running container in shell, it works. But how do I get it done during provisioning itself? Is there something I am missing in the Dockerfile?


Solution

  • In your Dockerfile you overwrite the default CMD and ENTRYPOINT from the StreamSets Data Collector Dockerfile. So the container only executes your command during startup and exits without errors afterwards. This is the reason why your container is in Exited (0) status.

    In general this is good and expected behavior. If you want to keep your container alive you need to execute another command in the foreground, which never ends. But unfortunately, you cannot run multiple CMDs in your docker file.

    I dug a little deeper. The default entry point of the image is ENTRYPOINT ["/docker-entrypoint.sh"]. This script sets up a few things and starts the Data Collector.

    It is required that the Data Collector is running before the pipeline is imported. So a solution could be to copy the default docker-entrypoint.sh and modify it to start the Data Collector and import the pipeline afterwards. You could to it like this:

    Dockerfile:

    FROM streamsets/datacollector:3.18.1
    
    COPY myPipeline.json /pipelinejsonlocation/
    # Replace docker-entrypoint.sh
    COPY docker-entrypoint.sh /docker-entrypoint.sh 
    
    EXPOSE 18630
    

    docker-entrypoint.sh (https://github.com/streamsets/datacollector-docker/blob/master/docker-entrypoint.sh):

    #!/bin/bash
    #
    # Copyright 2017 StreamSets Inc.
    #
    # Licensed under the Apache License, Version 2.0 (the "License");
    # you may not use this file except in compliance with the License.
    # You may obtain a copy of the License at
    #
    #     http://www.apache.org/licenses/LICENSE-2.0
    #
    # Unless required by applicable law or agreed to in writing, software
    # distributed under the License is distributed on an "AS IS" BASIS,
    # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    # See the License for the specific language governing permissions and
    # limitations under the License.
    #
    
    set -e
    
    # We translate environment variables to sdc.properties and rewrite them.
    set_conf() {
      if [ $# -ne 2 ]; then
        echo "set_conf requires two arguments: <key> <value>"
        exit 1
      fi
    
      if [ -z "$SDC_CONF" ]; then
        echo "SDC_CONF is not set."
        exit 1
      fi
    
      grep -q "^$1" ${SDC_CONF}/sdc.properties && sed 's|^#\?\('"$1"'=\).*|\1'"$2"'|' -i ${SDC_CONF}/sdc.properties || echo -e "\n$1=$2" >> ${SDC_CONF}/sdc.properties
    }
    
    # support arbitrary user IDs
    # ref: https://docs.openshift.com/container-platform/3.3/creating_images/guidelines.html#openshift-container-platform-specific-guidelines
    if ! whoami &> /dev/null; then
      if [ -w /etc/passwd ]; then
        echo "${SDC_USER:-sdc}:x:$(id -u):0:${SDC_USER:-sdc} user:${HOME}:/sbin/nologin" >> /etc/passwd
      fi
    fi
    
    # In some environments such as Marathon $HOST and $PORT0 can be used to
    # determine the correct external URL to reach SDC.
    if [ ! -z "$HOST" ] && [ ! -z "$PORT0" ] && [ -z "$SDC_CONF_SDC_BASE_HTTP_URL" ]; then
      export SDC_CONF_SDC_BASE_HTTP_URL="http://${HOST}:${PORT0}"
    fi
    
    for e in $(env); do
      key=${e%=*}
      value=${e#*=}
      if [[ $key == SDC_CONF_* ]]; then
        lowercase=$(echo $key | tr '[:upper:]' '[:lower:]')
        key=$(echo ${lowercase#*sdc_conf_} | sed 's|_|.|g')
        set_conf $key $value
      fi
    done
    
    # MODIFICATIONS:
    #exec "${SDC_DIST}/bin/streamsets" "$@"
    
    check_data_collector_status () {
       watch -n 1 ${SDC_DIST}/bin/streamsets cli -U http://localhost:18630 ping | grep -q 'version' && echo "Data Collector has started!" && import_pipeline
    }
    
    function import_pipeline () {
        sleep 1
    
        echo "Start to import pipeline"
        ${SDC_DIST}/bin/streamsets cli -U http://localhost:18630 -u admin -p admin store import -n myPipeline --stack -f /pipelinejsonlocation/myPipeline.json
    
        echo "Finished importing pipeline"
    }
    
    # Start checking if Data Collector is up (in background) and start Data Collector
    check_data_collector_status & ${SDC_DIST}/bin/streamsets $@
    

    I commented out the last line exec "${SDC_DIST}/bin/streamsets" "$@" of the default docker-entrypoint.sh and added two functions. check_data_collector_status () pings the Data Collector service until it is available. import_pipeline () imports your pipeline.

    check_data_collector_status () runs in background and ${SDC_DIST}/bin/streamsets $@ is started in foreground as before. So the pipeline is imported after the Data Collector service is started.