goweb-scrapinglarge-language-modelollama

Best ways to feed the Ollama LLM with a high data load


I am developing a chatbot for my university that will use a wiki with curriculum information for courses and other relevant data. One of the challenges is optimizing the use of Ollama to process the wiki content as context for user responses.

The issue is that the wiki is extremely extensive: initial tests estimate around 20,000 tokens for just a few pages, which can cause significant delays in Ollama's responses or even failures in understanding user requests.

Is there any documentation, methodology, or recommended approach to efficiently refine Ollama when handling large volumes of data? Or would this be resolved solely with processing power and the chosen model?

Currently, I am conducting tests on weaker models like gemma3:1b, but in production, we would have a much more powerful machine for this.

I am trying to include the HTML content in the message context, as it will be necessary for the chatbot to be aware of links to include in responses (obviously, this will be improved, and I will clean up unnecessary HTML data and tags).

content, _ := e.DOM.Html()
var contextBuilder strings.Builder
contextBuilder.WriteString(
    `Respond exclusively in Brazilian Portuguese. From the provided content below, extract and summarize the following information:
    1. Course curriculum;
    2. Available external activities;
    3. Information about the course coordination.
    Return the information clearly, organized, and concisely, using lists or bullet points when appropriate. Consider only the provided content:
    ---`)
contextBuilder.WriteString(content)
contextBuilder.WriteString("---")
message, err := ollama_service.SendRequest(ollama_dto.Request{
    Model: os.Getenv("OLLAMA_MODEL"),
    Messages: []ollama_dto.Message{
        {Role: "user",
            Content: contextBuilder.String()},
    },
})

Solution

  • Well... I think ChatGPT answers this type of question pretty well, but I'll give you a human response.

    You mentioned two of the disadvantages of putting too much tokens into LLMs:

    1. slow
    2. llm feels stupid

    Slow is because you don't have a good GPU. Make sense. You may also not have enough vRAM for the tokens you throw at it. 20k token is pretty big but not crazy.

    LLM feels stupid because weaker models can struggle to maintain coherence or collapse when they see too much text. Usually, better models can handle more text. We usually consider 1b a pretty small model, so I wouldn't expect it to be coherent at 20k tokens. If you give LLM too much text, big models tend to forget the middle, and small models freak out. 1B is too small as of June 2025.

    You might be better off using some LLM APIs like OpenRouter or Gemini. If you connect to Ollama via OpenAI SDK, you don't even have to change the code. Swap the API key and Base URL, and you will be good to go. If you use Ollama SDK... well changing the code is not that hard. Use OpenAI SDK next time. OpenAI chat completion API is basically the industry standard now.

    Solutions

    Ways that do not affect the quality of AI's response

    1. Get a better GPU
    2. Run Ollama on a machine with a better GPU
    3. Don't run the model on your own machine if you don't have a GPU. Use some LLM APIs. There are free options like OpenRouter or Gemini or Groq or Sambanova. Gemini flash is free and is fast and is crazy good at long context.

    Ways that lower the quality of AI's response

    Another way is to not throw so much token at your LLM. Techniques like RAG dynamically search and load tokens into the context based on what the user says (and other stuff. RAG is a very broad term that includes a bunch of different stuff).

    There are many ways to do RAG. Look it up and find the best option for yourself. I'm not sure if you have the time, though, but it's a good way to learn.

    Grab yourself a free Gemini key from AI Studio and throw everything into the context. This is the easiest way to do it, and it will probably perform better than RAG unless you actually spend some time optimizing RAG.

    You can also use fancy stuff like MemGPT (Letta), which can help make your project more complicated than it actually is. It might be a bit slow, but you can certainly play some animation to make it feel less tedious.

    Oh, also, if you want LLM to respond only in Brazilian Portuguese, you might as well write your system prompt in Brazilian Portuguese.