I just installed raw llama.cpp to run codellama-7b-instruct.Q5_K_M.gguf. I started it on llama's server but unfortunatly it is responding with really weird answers, which looks like it is trying to simulate its own conversation.
I have tried to use some kind of templates like this one:
[Instruction: You are an expert assistant. Always provide direct, concise answers.]
USER: What is 2 + 2?
ASSISTANT: 4.
USER: How to use console.log in JS?
ASSISTANT:
END OF CONVERSATION
But this work poorly and i realise that i need /chat
endpoint instead currently used /completion
.
If anyone have some kind of "extension" to llama.cpp so i can use /chat
endpoint or other way to use /chat
in llama.cpp please let me know.
Excuse my English, I'm still learning
For any other info write in comments!
It's most likely working poorly because you're not using the correct chat template for Code LLaMA, which is a slightly modified version of LLaMA 2's chat template:
<s>[INST] <<SYS>>
{{ system_prompt }}
<</SYS>>
{{ user_message_1 }} [/INST] {{ model_answer_1 }} </s>
<s>[INST] {{ user_message_2 }} [/INST]
According to the llama.cpp README, the chat completion endpoint is already supported. There's no need for an "extension".
llama-server -m model.gguf --port 8080
# Basic web UI can be accessed via browser: http://localhost:8080
# Chat completion endpoint: http://localhost:8080/v1/chat/completions
If supported, the appropriate chat template will be selected for your model once it's loaded. If you wish to specify a particular template, you may do so with the --chat-template
flag (e.g. llama-server -m codellama-7b-instruct.Q5_K_M.gguf -ngl 64 -c 0 --chat-template llama2
).
Some models use old or unusual chat templates. For those, you'd use the --jinja
and --chat-template-file
flags along with a custom Jinja chat template file (supported since b4524
), or use the older completion API and parse the outputs accordingly.