artificial-intelligence chatbot endpoint large-language-model llamacpp

How to create prompt with /chat endpoint for llama.cpp?

I just installed raw llama.cpp to run codellama-7b-instruct.Q5_K_M.gguf. I started it on llama's server but unfortunatly it is responding with really weird answers, which looks like it is trying to simulate its own conversation.

I have tried to use some kind of templates like this one:

[Instruction: You are an expert assistant. Always provide direct, concise answers.]  

USER: What is 2 + 2?
ASSISTANT: 4.

USER: How to use console.log in JS?
ASSISTANT:

END OF CONVERSATION

But this work poorly and i realise that i need /chat endpoint instead currently used /completion.

If anyone have some kind of "extension" to llama.cpp so i can use /chat endpoint or other way to use /chat in llama.cpp please let me know.

Excuse my English, I'm still learning
For any other info write in comments!

Solution

It's most likely working poorly because you're not using the correct chat template for Code LLaMA, which is a slightly modified version of LLaMA 2's chat template:

<s>[INST] <<SYS>>
{{ system_prompt }}
<</SYS>>
{{ user_message_1 }} [/INST] {{ model_answer_1 }} </s>
<s>[INST] {{ user_message_2 }} [/INST]

According to the llama.cpp README, the chat completion endpoint is already supported. There's no need for an "extension".

llama-server -m model.gguf --port 8080
# Basic web UI can be accessed via browser: http://localhost:8080
# Chat completion endpoint: http://localhost:8080/v1/chat/completions

If supported, the appropriate chat template will be selected for your model once it's loaded. If you wish to specify a particular template, you may do so with the --chat-template flag (e.g. llama-server -m codellama-7b-instruct.Q5_K_M.gguf -ngl 64 -c 0 --chat-template llama2).

Some models use old or unusual chat templates. For those, you'd use the --jinja and --chat-template-file flags along with a custom Jinja chat template file (supported since b4524), or use the older completion API and parse the outputs accordingly.