I can install and run Ollama service with GPU in an EC2 instance and make API calls to it from a web app in the following way:
First I need to create a docker network, so that the Ollama service and my web app share the same docker network:
docker network create my-net
Then I run the official Ollama docker container to run the service:
docker run -d --gpus=all -v ollama:/root/.ollama -p 11434:11434 --name ollama --net my-net ollama/ollama
Then I need to serve the model (LLM) with Ollama:
docker exec ollama ollama run <model_name> # like llama2, mistral, etc
And then I need to find out the public IP address of the Ollama service on this network, and export it as an API endpoint URL:
export OLLAMA_API_ENDPOINT=$(docker inspect -f '{{range.NetworkSettings.Networks}}{{.IPAddress}}{{end}}' ollama)
And finally, I can pass this endpoint URL to my web app to make calls with:
docker run -d -p 8080:8080 -e OLLAMA_API_ENDPOINT --rm --name my-web-app --net my-net app
With this, if you go to the following URL:
http://<PUBLIC_IP_OF_THE_EC2_INSTANCE>:8080
You can see the web app (chatbot) running and able to make API calls (chat) with the LLM.
Now I want to deploy this app in our AWS Kubernetes cluster (EKS). For that, I wrote the following inference.yaml
manifest to run Ollama and serve the LLM:
apiVersion: v1
kind: PersistentVolume
metadata:
name: ollama-charlie-pv
spec:
capacity:
storage: 100Gi
accessModes:
- ReadWriteOnce
hostPath:
path: /data/ollama
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: ollama-charlie-pvc
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 100Gi
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: ollama-charlie
spec:
replicas: 1
selector:
matchLabels:
app: ollama-charlie
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 1
maxUnavailable: 0
template:
metadata:
labels:
app: ollama-charlie
spec:
nodeSelector:
ollama-charlie-key: ollama-charlie-value
initContainers:
- name: download-llm
image: ollama/ollama
command: ["ollama", "run", "kristada673/solar-10.7b-instruct-v1.0-uncensored"]
volumeMounts:
- name: data
mountPath: /root/.ollama
containers:
- name: ollama-charlie
image: ollama/ollama
volumeMounts:
- name: data
mountPath: /root/.ollama
livenessProbe:
tcpSocket:
port: 80
initialDelaySeconds: 120 # Adjust based on your app's startup time
periodSeconds: 30
failureThreshold: 2 # Pod is restarted after 2 consecutive failures
volumes:
- name: data
persistentVolumeClaim:
claimName: ollama-charlie-pvc
restartPolicy: Always
---
apiVersion: v1
kind: Service
metadata:
name: ollama-charlie-service
spec:
selector:
app: ollama-charlie
ports:
- protocol: TCP
port: 11434
targetPort: 11434
Here, ollama-charlie-key: ollama-charlie-value
comes from the node group I created with a GPU (g4dn.xlarge
), and these are the key and value I gave to the node group.
But there's some problem because when I do kubectl apply -f inference.yaml
, the pod shows as pending and I get the following error:
Back-off restarting failed container download-llm in pod ollama-charlie-7745b595ff-5ldxt_default(57c6bba9-7d92-4cf8-a4ef-3b19f19023e4)
To diagnose it, when I do kubectl logs <pod_name> -c download-llm
, I get:
Error: could not connect to ollama app, is it running?
This means that the Ollama service is not getting started. Could anyone help me figure out why, and edit the inference.yaml
accordingly?
P.S.: Earlier, I tried with the following spec
in inference.yaml
:
spec:
initContainers:
- name: download-llm
image: ollama/ollama
command: ["ollama", "run", "kristada673/solar-10.7b-instruct-v1.0-uncensored"]
volumeMounts:
- name: data
mountPath: /root/.ollama
containers:
- name: ollama-charlie
image: ollama/ollama
volumeMounts:
- name: data
mountPath: /root/.ollama
resources:
limits:
nvidia.com/gpu: 1
Where I do not specify the node group I created and ask it to use a generic Nvidia GPU. That gave me the following error:
That's why I moved to specifying the key-value pair for the node group I created specifically for this deployment, and removed the instruction to use a generic Nvidia GPU.
I just went through the same thing while adding support for operating Ollama servers in the KubeAI project. Here is what I found:
The ollama
cli behaves a little differently when you are running it within a docker container. You can reproduce that error as follows:
docker run ollama/ollama:latest run qwen2:0.5b
Error: could not connect to ollama app, is it running?
When you execute ollama run
outside of docker, it appears to actually start up a HTTP API first, then the CLI starts sending requests to that API. When you run ollama run
inside the docker container it is assuming that the server is already running (hence the could not connect
part of the error). What you actually want to do in your instance is to just serve that HTTP API. The ollama serve
command will do that for you. It turns out that serve
is the default command specified in the Dockerfile: https://github.com/ollama/ollama/blob/1c70a00f716ed61c5b0a9e0f2a01876de0fc54d0/Dockerfile#L217
To resolve your error you just need to get rid of the command:
part of your Deployment. This will allow ollama
to start up and serve traffic.
The model will be pulled in and served when clients connect to the ollama Deployment (via your k8s Service) - either via a curl command or via running OLLAMA_HOST=<service-name>:<service-port> ollama run <your-model>
from another Pod in your cluster.