python nlp gpu huggingface-transformers huggingface-tokenizers

How to run a NLP+Transformers LLM on low memory GPUs?

I am trying to load an AI pre-trained model, from intel on hugging face, I have used Colab its resources exceeded, used Kaggle resources increased, used paperspace, which showing me an error:

The kernel for Text_Generation.ipynb appears to have died. It will restart automatically.

this is the model load script:

import transformers


model_name = 'Intel/neural-chat-7b-v3-1'
model = transformers.AutoModelForCausalLM.from_pretrained(model_name)
tokenizer = transformers.AutoTokenizer.from_pretrained(model_name)

def generate_response(system_input, user_input):

    # Format the input using the provided template
    prompt = f"### System:\n{system_input}\n### User:\n{user_input}\n### Assistant:\n"

    # Tokenize and encode the prompt
    inputs = tokenizer.encode(prompt, return_tensors="pt", add_special_tokens=False)

    # Generate a response
    outputs = model.generate(inputs, max_length=1000, num_return_sequences=1)
    response = tokenizer.decode(outputs[0], skip_special_tokens=True)

    # Extract only the assistant's response
    return response.split("### Assistant:\n")[-1]


# Example usage
system_input = "You are a math expert assistant. Your mission is to help users understand and solve various math problems. You should provide step-by-step solutions, explain reasonings and give the correct answer."
user_input = "calculate 100 + 520 + 60"
response = generate_response(system_input, user_input)
print(response)

# expected response
"""
To calculate the sum of 100, 520, and 60, we will follow these steps:

1. Add the first two numbers: 100 + 520
2. Add the result from step 1 to the third number: (100 + 520) + 60

Step 1: Add 100 and 520
100 + 520 = 620

Step 2: Add the result from step 1 to the third number (60)
(620) + 60 = 680

So, the sum of 100, 520, and 60 is 680.
"""

My purpose is to load this pretrained model, I have done some research on my end I have find some solutions but not working with me,

download packages using cuda instead of pip

Solution

I would recommend looking into model quantization as this is one of the approaches which specifically addresses this type of problem, of loading a large model for inference.

TheBloke has provided a quantized version of this model which is available here: neural-chat-7B-v3-1-AWQ. To use this, you'll need to use AutoAWQ, and as per Hugging Face in this notebook, for Colab you need to install an earlier version given Colab's CUDA version.

You should also make sure your model is using GPU, not CPU, by adding .cuda() to the input tensor after it is generated:

!pip install -q transformers accelerate
!pip install -q -U https://github.com/casper-hansen/AutoAWQ/releases/download/v0.1.6/autoawq-0.1.6+cu118-cp310-cp310-linux_x86_64.whl

import torch
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

model_name = 'TheBloke/neural-chat-7B-v3-1-AWQ'
### Use AutoAWQ and from quantized instead of transformers here
model = AutoAWQForCausalLM.from_quantized(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

def generate_response(system_input, user_input):

    # Format the input using the provided template
    prompt = f"### System:\n{system_input}\n### User:\n{user_input}\n### Assistant:\n"

    ### ADD .cuda()
    inputs = tokenizer.encode(prompt, return_tensors="pt", add_special_tokens=False).cuda()

    # Generate a response
    outputs = model.generate(inputs, max_length=1000, num_return_sequences=1)
    response = tokenizer.decode(outputs[0], skip_special_tokens=True)

    # Extract only the assistant's response
    return response.split("### Assistant:\n")[-1]