kubernetesdockerfilepy-langchainollamadocker-entrypoint

Running Ollama as a k8s STS with external script as entrypoint to load models


I manage to run Ollama as a k8s STS. I am using it for Python Langchain LLM/RAG application. However the following Dockerfile ENTRYPOINT script which tries to pull a list of images exported as MODELS ENV from k8s STS manifest runs into problem. Dockerfile has the following ENTRYPOINT and CMD:

ENTRYPOINT ["/usr/local/bin/run.sh"]
CMD ["bash"]

run.sh:

#!/bin/bash
set -x
ollama serve&
sleep 10
models="${MODELS//,/ }"
for i in "${models[@]}"; do \
      echo model: $i  \
      ollama pull $i \
    done

k8s logs:

+ models=llama3.2
/usr/local/bin/run.sh: line 10: syntax error: unexpected end of file

David Maze's solution:

          lifecycle:
            postStart:
              exec:
                command:
                  - bash
                  - -c
                  - |
                    for i in $(seq 10); do
                      ollama ps && break
                      sleep 1
                    done
                    for model in ${MODELS//,/ }; do
                      ollama pull "$model"
                    done
ollama-0          1/2     CrashLoopBackOff     4 (3s ago)        115s
ollama-1          1/2     CrashLoopBackOff     4 (1s ago)        115s
  Warning  FailedPostStartHook  106s (x3 over 2m14s)  kubelet            PostStartHook failed
$ k logs -fp ollama-0
Defaulted container "ollama" out of: ollama, fluentd
Error: unknown command "ollama" for "ollama"

Update Dockerfile:

ENTRYPOINT ["/bin/ollama"]
#CMD ["bash"]
CMD ["ollama", "serve"]

I need the customized Dockerfile so that I could install Nvidia Container Toolkit.


Solution

  • At a mechanical level, the backslashes inside the for loop are causing problems. This causes the shell to combine the lines together, so you get a single command echo model: $i ollama pull $i done, but there's not a standalone done command to terminate the loop.

    The next problem you'll run into is that this entrypoint script is the only thing the container runs, and when this script exits, the container will exit as well. It doesn't matter that you've started the Ollama server in the background. If you wanted to run the container this way, you need to wait for the server to exit. That would look something like

    #!/bin/bash
    ollama serve &
    pid=$!                       # ADD: save the process ID of the server
    sleep 10
    models="${MODELS//,/ }"
    for i in "${models[@]}"; do  # FIX: remove backslashes
      echo model: "$i"
      ollama pull "$i"
    done
    wait "$pid"                  # ADD: keep the script running as long as the server is too
    

    However, this model of starting a background process and then waiting for it often isn't the best approach. If the Pod gets shut down, for example, the termination signal will go to the wrapper script and not the Ollama server, and you won'd be able to have a clean shutdown.

    In a Kubernetes context (you say you're running this in a StatefulSet) a PostStart hook fits here. This will let you run an unmodified image, but add your own script that runs at about the same time as the container startup. In a Kubernetes manifest this might look like:

    spec:
      template:
        spec:
          containers:
            - name: ollama
              image: ollama/ollama  # the unmodified upstream image
              lifecycle:
                postStart:
                  exec:
                    command:
                      - /bin/sh
                      - -c
                      - |
                          for i in $(seq 10); do
                            ollama ps && break
                            sleep 1
                          done
                          for model in llama3.2; do
                            ollama pull "$model"
                          done
    

    This setup writes a shell script inline in the Kubernetes manifest. It wraps it in /bin/sh -c to it can be run this way. This uses an "exec" mechanism, so the script runs as a secondary process in the same container. The first fragment waits up to 10 seconds for the server to be running, and the second is the loop to load the models.