I am trying to load an AI pre-trained model, from intel on hugging face, I have used Colab its resources exceeded, used Kaggle resources increased, used paperspace, which showing me an error:
The kernel for Text_Generation.ipynb appears to have died. It will restart automatically.
this is the model load script:
import transformers
model_name = 'Intel/neural-chat-7b-v3-1'
model = transformers.AutoModelForCausalLM.from_pretrained(model_name)
tokenizer = transformers.AutoTokenizer.from_pretrained(model_name)
def generate_response(system_input, user_input):
# Format the input using the provided template
prompt = f"### System:\n{system_input}\n### User:\n{user_input}\n### Assistant:\n"
# Tokenize and encode the prompt
inputs = tokenizer.encode(prompt, return_tensors="pt", add_special_tokens=False)
# Generate a response
outputs = model.generate(inputs, max_length=1000, num_return_sequences=1)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
# Extract only the assistant's response
return response.split("### Assistant:\n")[-1]
# Example usage
system_input = "You are a math expert assistant. Your mission is to help users understand and solve various math problems. You should provide step-by-step solutions, explain reasonings and give the correct answer."
user_input = "calculate 100 + 520 + 60"
response = generate_response(system_input, user_input)
print(response)
# expected response
"""
To calculate the sum of 100, 520, and 60, we will follow these steps:
1. Add the first two numbers: 100 + 520
2. Add the result from step 1 to the third number: (100 + 520) + 60
Step 1: Add 100 and 520
100 + 520 = 620
Step 2: Add the result from step 1 to the third number (60)
(620) + 60 = 680
So, the sum of 100, 520, and 60 is 680.
"""
My purpose is to load this pretrained model, I have done some research on my end I have find some solutions but not working with me,
download packages using cuda instead of pip
I would recommend looking into model quantization as this is one of the approaches which specifically addresses this type of problem, of loading a large model for inference.
TheBloke has provided a quantized version of this model which is available here: neural-chat-7B-v3-1-AWQ. To use this, you'll need to use AutoAWQ, and as per Hugging Face in this notebook, for Colab you need to install an earlier version given Colab's CUDA version.
You should also make sure your model is using GPU, not CPU, by adding .cuda()
to the input tensor after it is generated:
!pip install -q transformers accelerate
!pip install -q -U https://github.com/casper-hansen/AutoAWQ/releases/download/v0.1.6/autoawq-0.1.6+cu118-cp310-cp310-linux_x86_64.whl
import torch
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer
model_name = 'TheBloke/neural-chat-7B-v3-1-AWQ'
### Use AutoAWQ and from quantized instead of transformers here
model = AutoAWQForCausalLM.from_quantized(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
def generate_response(system_input, user_input):
# Format the input using the provided template
prompt = f"### System:\n{system_input}\n### User:\n{user_input}\n### Assistant:\n"
### ADD .cuda()
inputs = tokenizer.encode(prompt, return_tensors="pt", add_special_tokens=False).cuda()
# Generate a response
outputs = model.generate(inputs, max_length=1000, num_return_sequences=1)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
# Extract only the assistant's response
return response.split("### Assistant:\n")[-1]