pythonlangchainchromadbcontent-based-retrieval

Persist ParentDocumentRetriever of langchain


I am using ParentDocumentRetriever of langchain. Using mostly the code from their webpage I managed to create an instance of ParentDocumentRetriever using bge_large embeddings, NLTK text splitter and chromadb. I added documents to it, so that I c

embedding_function = HuggingFaceEmbeddings(model_name='BAAI/bge-large-en-v1.5', cache_folder=hf_embed_path)
# This text splitter is used to create the child documents
child_splitter =  NLTKTextSplitter(chunk_size=400)
# The vectorstore to use to index the child chunks
vectorstore = Chroma(
    collection_name="full_documents",
    embedding_function=embedding_function,
    persist_directory="./chroma_db_child"
)

# The storage layer for the parent documents
store = InMemoryStore()
retriever = ParentDocumentRetriever(
    vectorstore=vectorstore, 
    docstore=store, 
    child_splitter=child_splitter,
)

retriever.add_documents(docs, ids=None)

I added documents to it, so that I can query using the small chunks to match but to return the full document: matching_docs = retriever.get_relevant_documents(query_text) Chromadb collection 'full_documents' was stored in /chroma_db_child. I can read the collection and query it. I get back the chunks, which is what is expected:

vector_db = Chroma(
    collection_name="full_documents",
    embedding_function=embedding_function,
    persist_directory="./chroma_db_child"
)

matching_doc = vector_db.max_marginal_relevance_search('whatever', 3)
len(matching_doc)
>>3

One thing I can't figure out is how to persist the whole structure. This code uses store = InMemoryStore(), which means that once I stopped execution, it goes away.

Is there a way, perhaps using something else instead of InMemoryStore(), to create ParentDocumentRetriever and persist both full documents and the chunks, so that I can restore them later without having to go through retriever.add_documents(docs, ids=None) step?


Solution

  • I had the same problem and found the solution here: https://github.com/langchain-ai/langchain/issues/9345

    You need to use the create_kv_docstore() function like this:

    from langchain.storage._lc_store import create_kv_docstore
    
    fs = LocalFileStore("./store_location")
    store = create_kv_docstore(fs)
    parent_splitter = RecursiveCharacterTextSplitter(chunk_size=2000)
    child_splitter = RecursiveCharacterTextSplitter(chunk_size=400)
    
    vectorstore = Chroma(collection_name="split_parents", embedding_function=embeddings, persist_directory="./db")
    retriever = ParentDocumentRetriever(
        vectorstore=vectorstore,
        docstore=store,
        child_splitter=child_splitter,
        parent_splitter=parent_splitter,
    )
    retriever.add_documents(documents, ids=None)
    

    You will end up with 2 folders: the chroma db "db" with the child chunks and the "data" folder with the parents documents.

    I think there is also a possibility of saving the documents in a Redis db or Azure blobstorage (https://python.langchain.com/docs/integrations/document_loaders/azure_blob_storage_container) but I am not sure.