I am facing an issue with the Llama 2-7B model where the output is consistently limited to only 511 tokens, even though the model should theoretically be capable of producing outputs up to a maximum of 4096 tokens.
I’ve tried setting the max_tokens parameter to higher values, such as 3000, and have calculated the available tokens by subtracting the prompt tokens from the model’s total token limit (4096 tokens). However, despite these adjustments, I continue to receive outputs capped at 511 tokens.
Here’s a snippet of the code I am using to interact with the model:
import psutil
import os
import warnings
from llama_cpp import Llama
# Suppress warnings
warnings.filterwarnings("ignore")
# Path to the model
model_path = "C:/Llama_project/models/llama-2-7b-chat.Q2_K.gguf"
# Load the model
llm = Llama(model_path=model_path)
# System message to set the behavior of the assistant
system_message = "You are a helpful assistant."
# Function to ask questions
def ask_question(question):
# Use user input for the question prompt
prompt = f"Answer the following question: {question}"
# Calculate the remaining tokens for output based on the model's 4096 token limit
prompt_tokens = len(prompt.split()) # Rough token count estimate
max_output_tokens = 4096 - prompt_tokens # Tokens left for output
# Monitor memory usage before calling the model
process = psutil.Process(os.getpid())
mem_before = process.memory_info().rss / 1024 ** 2 # Memory in MB
# Get the output from the model with the calculated max tokens for output
output = llm(prompt=prompt, max_tokens=max_output_tokens, temperature=0.7, top_p=1.0)
# Monitor memory usage after calling the model
mem_after = process.memory_info().rss / 1024 ** 2 # Memory in MB
# Clean the output and return only the answer text
return output["choices"][0]["text"].strip()
# Main loop for user interaction
while True:
user_input = input("Ask a question (or type 'exit' to quit): ")
if user_input.lower() == 'exit':
print("Exiting the program.")
break
# Get the model's response
answer = ask_question(user_input)
# Print only the answer
print(f"Answer: {answer}")
Problem Details:
Tried:
I would expect the model to generate responses that are close to the token limit (ideally closer to 3000 tokens or more, depending on the input), but it keeps producing output limited to 511 tokens.
Try to add n_ctx to 2048 in Llama constructor, so:
Llama(n_ctx=2048, model_path=model_path)
This parameters tells model what is the maximum length of the prompt and response combined.