I'm trying to use Llama 2 chat (via hugging face) with 7B parameters in Google Colab (Python 3.10.12). I've already obtain my access token via Meta. I simply use the code in hugging face on how to implement the model along with my access token. Here is my code:
!pip install transformers
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
token = "---Token copied from Hugging Face and pasted here---"
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-chat-hf", token=token)
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-chat-hf", token=token)
It starts downloading the model but when it reaches Loading checkpoint shards: it just stops running and there is no error:
The issue is with Colab instance running out of RAM. Based on your comments you are using basic Colab instance with 12.7 Gb CPU RAM.
For LLama model you'll need:
Check this link for the details on the required resources: huggingface.co/NousResearch/Llama-2-7b-chat-hf/discussions/3
Also if you want only to do inference (predictions) on the model I would recommend to use it's quantized 4bit or 8bit versions. Both can be ran on CPU and don't need a lot of memory.