pythonartificial-intelligencelarge-language-modelollama

Is there a way to manually set the first part of a model's response in Ollama?


A bit of context,

I recently got into tinkering with LLMs using the ollama python package. My latest project is an AI assistant (how original, I know) with tool calling capabilities. I'm using the qwen3:4b model, and while due to it's size it isn't the greatest at tool calling, it still gets the job done. One of the tools I wanted to integrate was a "file explorer" type of tool. The way I went about doing this was that the model would first call an "open_file()" function, which would list all the files, and then after that it would generate another tool call "open_document(filename)" to open one of those files. So in other words, the model makes two tool calls, one to list the files, followed by another to open one. One thing to note is that for the second tool call, the "open_document(filename)" function is the only available tool the model can choose from. Now when this does work it works flawlessly, however the persistent problem I have noticed is that a lot of times it will not generate any tool calls the second time, i.e. the tool call necessary to actually open the document after the files have been listed.

The Question:

As we know, an LLM generates it's response one token at a time. Is there a way to get the model to start generating from some sort of starting point? Or in other words, is there a way to get the model to complete what's already there instead of generate it's entire response from scratch?

Example:

Input:
"""
User: Open the about file.
Assistant: *open_file()
Tool: Available files, pleases select one:
 - file.txt
 - example.txt
 - about.txt
 - random.txt
Assistant: *open_document("
"""

LLM Output:
"""
about.txt")
"""

Solution

  • Sadly, it appears that the Ollama API does not support this kind of use of LLMs. However, I was able to achieve the results I wanted by using Hugging Face transformers and Pytorch. It's not a pretty solution, and I really hope that the Ollama team implement features such as forced tool calls and multiple choice tools soon, but for now this will do:

    from transformers import AutoTokenizer, AutoModelForCausalLM
    import torch
    
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    print(device)
    
    model_name = 'Qwen/Qwen3-0.6B'
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModelForCausalLM.from_pretrained(model_name).to(device)
    
    def open_document(filename: str) -> str:
        """
        Opens the text document of name "filename".
        
        Example call:
        
        open_document("example.txt")
        Args:
            filename: The name of the document file
        Returns:
            str: The file contents
        """
        return f"Succesfully opened file \"{filename}\"."
    tools = [open_document]
    
    files_list = """
    File Explorer
    Available files, use the "*open_file()*" function to open only one:
     - about.txt
     - coding_paper.txt
     - system_requirements.txt
     - updates.txt"""
    messages = [{'role': 'user', 'content': "What's your latest update?"},
                {'role': 'tool', 'content': files_list},
                {'role': 'assistant', 'content': '<tool_call>\n{"name": "open_document", "arguments": {"filename": "'}]
    
    TOP_K=5
    MAX_NEW_TOKENS = 32768
    
    def finish_sentence(messages):
        input_ids = tokenizer.apply_chat_template(messages, tools=tools, return_tensors='pt', padding=True, truncation=True).to(device)[:, :-2] # Removes the EOS token
        print(tokenizer.decode(input_ids[0], skip_special_tokens=False), end='', flush=True)
        while True:
            next_token, input_ids, eos = generate_next_token(input_ids)
            if eos:
                break
            print(next_token, end='', flush=True)
    
    def generate_next_token(input_ids):
        with torch.no_grad():
            model_output = model(input_ids)
        logits = model_output.logits
        next_token_logits = logits[0, -1, :]
        probabilities = torch.softmax(next_token_logits, dim=-1)
        top_k_probs, top_k_indices = torch.topk(probabilities, TOP_K)
        predicted_token_id = top_k_indices[0].reshape(1,)
        input_ids = torch.cat([input_ids, predicted_token_id.unsqueeze(0)], dim=-1)
        token = tokenizer.decode([predicted_token_id.item()])
        eos = False
        if predicted_token_id == tokenizer.eos_token_id:
            eos = True
        return token, input_ids, eos
    
    finish_sentence(messages)
    

    This is only a proof of concept code, but it successfully completes the tool call argument. This works because the EOS token at the end of the last message is removed, allowing the model to continue where I left off and finish the tool call as if it had initiated it. Keep in mind, this code is not meant for running tool calls, it's just a proof of concept to show a forced tool call.