python python-3.x nlp llama llama-cpp-python

llama-cpp-python not using NVIDIA GPU CUDA

I have been playing around with oobabooga text-generation-webui on my Ubuntu 20.04 with my NVIDIA GTX 1060 6GB for some weeks without problems. I have been using llama2-chat models sharing memory between my RAM and NVIDIA VRAM. I installed without much problems following the intructions on its repository.

So what I want now is to use the model loader llama-cpp with its package llama-cpp-python bindings to play around with it by myself. So using the same miniconda3 environment that oobabooga text-generation-webui uses I started a jupyter notebook and I could make inferences and everything is working well BUT ONLY for CPU.

A working example bellow,

from llama_cpp import Llama

llm = Llama(model_path="/mnt/LxData/llama.cpp/models/meta-llama2/llama-2-7b-chat/ggml-model-q4_0.bin", 
            n_gpu_layers=32, n_threads=6, n_ctx=3584, n_batch=521, verbose=True), 

prompt = """[INST] <<SYS>>
Name the planets in the solar system? 
<</SYS>>
[/INST] 
"""
output = llm(prompt, max_tokens=350, echo=True)
print(output['choices'][0]['text'].split('[/INST]')[-1])

Of course! Here are the eight planets in our solar system, listed in order from closest to farthest from the Sun:

Mercury

Venus

Earth

Mars

Jupiter

Saturn

Uranus

Neptune

Note that Pluto was previously considered a planet but is now classified as a dwarf planet due to its small size and unique orbit.

I want to make inference using GPU as well. What is wrong? Why can't I offload to gpu like the parameter n_gpu_layers=32 specifies and also like oobabooga text-generation-webui already does on the same miniconda environment whithout any problems?

Solution

After searching around and suffering quite for 3 weeks I found out this issue on its repository.

The llama-cpp-python needs to known where is the libllama.so shared library. So exporting it before running my python interpreter, jupyter notebook etc. did the trick.

For using the miniconda3 installation used by oobabooga text-generation-webui I exported it like bellow:

export LLAMA_CPP_LIB=/yourminicondapath/miniconda3/lib/python3.10/site-packages/llama_cpp_cuda/libllama.so

Voilà!!!!

On importing from llama_cpp import Llama I get

ggml_init_cublas: found 1 CUDA devices: Device 0: NVIDIA GeForce GTX 1060, compute capability 6.1

And on

llm = Llama(model_path="/mnt/LxData/llama.cpp/models/meta-llama2/llama-2-7b-chat/ggml-model-q4_0.bin", 
            n_gpu_layers=28, n_threads=6, n_ctx=3584, n_batch=521, verbose=True),

...

llama_model_load_internal: using CUDA for GPU acceleration llama_model_load_internal: mem required = 2381.32 MB (+ 1026.00 MB per state) llama_model_load_internal: allocating batch_size x (512 kB + n_ctx x 128 B) = 480 MB VRAM for the scratch buffer llama_model_load_internal: offloading 28 repeating layers to GPU llama_model_load_internal: offloaded 28/35 layers to GPU llama_model_load_internal: total VRAM used: 3521 MB ...