Below Python program is intended to translate large English texts into French. I use a for loop to feed a series of reports into Ollama.
from functools import cached_property
from ollama import Client
class TestOllama:
@cached_property
def ollama_client(self) -> Client:
return Client(host=f"http://127.0.0.1:11434")
def translate(self, text_to_translate: str):
ollama_response = self.ollama_client.generate(
model="mistral",
prompt=f"translate this French text into English: {text_to_translate}"
)
return ollama_response['response'].lstrip(), ollama_response['total_duration']
def run(self):
reports = ["reports_text_1", "reports_text_2"....] # avearge text size per report is between 750-1000 tokens.
for each_report in reports:
try:
translated_report, total_duration = self.translate(
text_to_translate=each_report
)
print(f"Translated text:{translated_report}, Time taken:{total_duration}")
except Exception as e:
pass
if __name__ == '__main__':
job = TestOllama()
job.run()
docker command to run ollama:
docker run -d --gpus=all --network=host --security-opt seccomp=unconfined -v report_translation_ollama:/root/.ollama --name ollama ollama/ollama
My question is: When I run this script on V100 and H100, I don't see a significant difference in execution time. I've avoided parallelism, thinking that Ollama might internally use parallelism to process. However, when I check with the htop command, I see only one core being used. Am I correct in my understanding?
I am a beginner in NLP, so any help or guidance on how to organize my code (e.g., using multithreading to send Ollama requests) would be appreciated.
Flags OLLAMA_NUM_PARALLEL
and OLLAMA_MAX_LOADED_MODELS
were added in v0.1.33. You can set them when starting the Ollama server:
OLLAMA_NUM_PARALLEL=4 OLLAMA_MAX_LOADED_MODELS=4 ollama serve
OLLAMA_MAX_LOADED_MODELS
- The maximum number of models that can be loaded concurrently provided they fit in available memory. The default is 3 * the number of GPUs or 3 for CPU inference.OLLAMA_NUM_PARALLEL
- The maximum number of parallel requests each model will process at the same time. The default will auto-select either 4 or 1 based on available memory.OLLAMA_MAX_QUEUE
- The maximum number of requests Ollama will queue when busy before rejecting additional requests. The default is 512.OLLAMA_NUM_PARALLEL
.