[SOLVED] Llama-2 7B-hf repeats context of question directly from input prompt, cuts off with newlines

Llama-2 7B-hf repeats context of question directly from input prompt, cuts off with newlines

Context: I am trying to query Llama-2 7B, taken from HuggingFace (meta-llama/Llama-2-7b-hf). I give it a question and context (I would guess anywhere from 200-1000 tokens), and ask it to answer the question based on the context (context is retrieved from a vectorstore using similarity search). Here are my two problems:

The answer ends, and the rest of the tokens until it reaches max_new_tokens are all newlines. Or it just doesn't generate any text and the entire response is newlines. Adding a repetition_penalty of 1.1 or greater has solved infinite newline generation, but does not get me full answers.
For answers that do generate, they are copied word for word from the given context. This remains the same with repetition_penalty=1.1, and making the repetition penalty too high makes the answer nonsense.

I have only tried using temperature=0.4 and temperature=0.8, but from what I have done, tuning temperature and repetition_penalty both result in either the context being copied or a nonsensical answer.

Note about the "context": I am using a document stored in a Chroma vector store, and similarity search retrieves the relevant information before I pass it to Llama.

Example Problem: My query is to summarize a certain Topic X.

query = "Summarize Topic X"

The retrieved context from the vectorstore has 3 sources that looks something like this (I format the sources in my query to the LLM separated by newlines):

context = """When talking about Topic X, Scenario Y is always referred to. This is due to the relation of
Topic X is a broad topic which covers many aspects of life.
No one knows when Topic X became a thing, its origin is unknown even to this day."""

Then the response from Llama-2 directly mirrors one piece of context, and includes no information from the others. Furthermore, it produces many newlines after the answer. If the answer is 100 tokens, and max_new_tokens is 150, I have 50 newlines.

response = "When talking about Topic X, Scenario Y is always referred to. This is due to the relation of \n\n\n\n"

One of my biggest issues is that in addition to copying one piece of context, if the context ends mid-sentence, so does the LLM response.

Is anyone else experiencing anything like this (newline issue or copying part of your input prompt)? Has anyone found a solution?

Solution

This is a common issue with pre-trained base models like Llama.

My first thought would be to select a model that has some sort of instruction tuning done to it i.e https://huggingface.co/meta-llama/Llama-2-7b-chat. Instruction tuning impacts the model's ability to solve tasks reliably, as opposed to the base model, which is often just trained to predict the next token (which is often why the cutoff happens).

The second thing, in my experience, I have seen that has helped is using the same prompt format that was used during training. You can see in the source code the prompt format used in training and generation by Meta. Here is a thread about it.

Finally, for repetition, using a Logits Processor at generation-time has been helpful to reduce repetition.