I'm trying to build a chatbot using Ollama locally. And I'm stuck with my embedding process (chromadb).
When I provide a full pdf, the kernel crashes during the embedding process.
It works fine when I provide a single chapter from the book.
Problem block:
# Create the Chroma vector store
from langchain_chroma.vectorstores import Chroma
try:
vector_db = Chroma.from_documents(
documents=chunked_document,
embedding=embedding_model,
collection_name="local-rag",
persist_directory="./db/db_nomic"
)
print("Embedded Documents stored in ChromaDB successfully!")
except Exception as e:
print(f"An error occurred: {e}")
Note
embedding_model = OllamaEmbeddings(model="nomic-embed-text")
chunked_document = [Document(metadata={'source': 'xxx', 'page': 1, 'math_expressions': 'xxx'}, page_content=''), .... ]
Additional info:
Python version = 3.12.7
What I've tried so far:
I tried moving the code to a python file instead of using jupyter notebook. The execution stops at the same block.
I tried to embed in batches. The same issue occurs
from langchain_chroma.vectorstores import Chroma
vector_db = Chroma(
collection_name="local-rag",
persist_directory="./dtbs/db_nomic",
embedding_function=embedding_model
)
texts = [chunk.page_content for chunk in chunked_document]
metadatas = [chunk.metadata for chunk in chunked_document]
batch_size = 100
for i in range(0, len(texts), batch_size):
batch_texts = texts[i:i+batch_size]
batch_metadatas = metadatas[i:i+batch_size]
vector_db.add_texts(texts=batch_texts, metadatas=batch_metadatas)
chapter_paths = [
"./partial_databases/db_nomic/ch1",
"./partial_databases/db_nomic/ch2",
...,
]
vector_db = Chroma(
collection_name = "local-rag",
persist_directory = "./db/db_nomic",
embedding_function = embedding_model
)
# Merge documents from each chapter database into main_db
for path in chapter_paths:
chapter_db = Chroma(
collection_name = "local-rag",
persist_directory=path,
embedding_function=embedding_model
)
# Retrieve all documents (vectors) from the current chapter database
chapter_data = chapter_db.get()
# Extract documents and metadatas
docs = chapter_data['documents']
metadatas = chapter_data['metadatas']
vector_db.add_texts(texts=docs, metadatas=metadatas)
print("Documents successfully merged into main database.")
I'm expecting to create a vector database using chromadb to store the whole pdf (246 pages)
These steps solved my issue:
Virtual Environment
Jupyter Notebook
to a python file
As the problem was solved by fresh installation of the dependencies, Most probably I faced the issue because of some internal dependency conflict.