databasevectorlangchainchromadbretrieval-augmented-generation

RAG using Langchain / Chroma - Unable to save more than 99 Records to Database


I'm using the following code to load the content of markdown files (only one file, in my case), split it into chunks and then embed and store the chunks one by one. My file is split into 801 chunks. However, this code is unable to save the embeddings to disk in the vector db.

def load_documents():

loader = DirectoryLoader(DATA_PATH, glob="*.md")

documents = loader.load()

return documents

def split_text(documents: list[Document]):

text_splitter = RecursiveCharacterTextSplitter(

    chunk_size=300,

    chunk_overlap=100,

    length_function=len,

    add_start_index=True,

)

chunks = text_splitter.split_documents(documents)

print(f"Split {len(documents)} documents into {len(chunks)} chunks.")



document = chunks[10]

print(document.page_content)

print(document.metadata)



return chunks

def save_to_chroma(chunks: list[Document]):

# Clear out the database first.

if os.path.exists(CHROMA_PATH):

    shutil.rmtree(CHROMA_PATH)



# Create a new DB from the documents.

db = Chroma.from_documents(

    chunks, OpenAIEmbeddings(), persist_directory=CHROMA_PATH

)

While analysing this problem, I attempted to save the chunks one by one instead, using a for loop:

for i, chunk in enumerate(chunks): db = Chroma.from_documents( chunks, OpenAIEmbeddings(), persist_directory=CHROMA_PATH )

I found that the code does save up to 99 chunks / embeddings, but always crashes when it tries to save further data. To investigate further, I opened the underlying database using DB Browser for SQLite and saw that Chroma was saving a max of 99 records in the 'embeddings' table. However, I was able to manually add more records to it i.e. exceeding 99.

I also tried a few variations i.e. by:

However, none of the above made any difference.

Does anybody know why this is happening and how to solve this problem?


Solution

  • My problem was solved when I re-installed python on my pc due to some other problem. Now, when I run the code, it works like magic! All the chunks are being saved now regardless of whether I save them all in one go or in batches.

    EDIT: The problem was actually due to a conflict in the installed libraries. It gets resolved when you use a virtual environment, so no need to re-install python.