artificial-intelligencechatbotendpointlarge-language-modelllamacpp

How to create prompt with /chat endpoint for llama.cpp?


I just installed raw llama.cpp to run codellama-7b-instruct.Q5_K_M.gguf. I started it on llama's server but unfortunatly it is responding with really weird answers, which looks like it is trying to simulate its own conversation.

I have tried to use some kind of templates like this one:

[Instruction: You are an expert assistant. Always provide direct, concise answers.]  

USER: What is 2 + 2?
ASSISTANT: 4.

USER: How to use console.log in JS?
ASSISTANT:

END OF CONVERSATION

But this work poorly and i realise that i need /chat endpoint instead currently used /completion.

If anyone have some kind of "extension" to llama.cpp so i can use /chat endpoint or other way to use /chat in llama.cpp please let me know.

Excuse my English, I'm still learning
For any other info write in comments!


Solution

  • It's most likely working poorly because you're not using the correct chat template for Code LLaMA, which is a slightly modified version of LLaMA 2's chat template:

    <s>[INST] <<SYS>>
    {{ system_prompt }}
    <</SYS>>
    {{ user_message_1 }} [/INST] {{ model_answer_1 }} </s>
    <s>[INST] {{ user_message_2 }} [/INST]
    

    According to the llama.cpp README, the chat completion endpoint is already supported. There's no need for an "extension".

    llama-server -m model.gguf --port 8080
    # Basic web UI can be accessed via browser: http://localhost:8080
    # Chat completion endpoint: http://localhost:8080/v1/chat/completions
    

    If supported, the appropriate chat template will be selected for your model once it's loaded. If you wish to specify a particular template, you may do so with the --chat-template flag (e.g. llama-server -m codellama-7b-instruct.Q5_K_M.gguf -ngl 64 -c 0 --chat-template llama2).

    Some models use old or unusual chat templates. For those, you'd use the --jinja and --chat-template-file flags along with a custom Jinja chat template file (supported since b4524), or use the older completion API and parse the outputs accordingly.