I used RAG with LLAMA3 for AI bot. I find RAG with chromadb is much slower than call LLM itself. Following the test result, with just one simple web page about 1000 words, it takes more than 2 seconds for retrieving:
Time used for retrieving: 2.245511054992676
Time used for LLM: 2.1182022094726562
Here is my simple code:
embeddings = OllamaEmbeddings(model="llama3")
vectorstore = Chroma.from_documents(documents=splits, embedding=embeddings)
retriever = vectorstore.as_retriever()
question = "What is COCONut?"
start = time.time()
retrieved_docs = retriever.invoke(question)
formatted_context = combine_docs(retrieved_docs)
end = time.time()
print(f"Time used for retrieving: {end - start}")
start = time.time()
answer = ollama_llm(question, formatted_context)
end = time.time()
print(f"Time used for LLM: {end - start}")
I found when my chromaDB size just about 1.4M, it takes more than 20 seconds for retrieving and still only takes about 3 or 4 seconds for LLM. Is there anything I missing? or RAG tech itself is so slow?
Please mark this as a answer, if you think this sets a perfect context for your case.