pythonprogress-barlangchainfaiss

How can I add a progress bar/status when creating a vector store with langchain?


Creating a vector store with the Python library langchain may take a while. How can I add a progress bar?


Example of code where a vector store is created with langchain:

import pprint
from langchain_community.vectorstores import FAISS
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain.docstore.document import Document

model = "sentence-transformers/multi-qa-MiniLM-L6-cos-v1"
embeddings = HuggingFaceEmbeddings(model_name = model)

def main():
    doc1 = Document(page_content="The sky is blue.",    metadata={"document_id": "10"})
    doc2 = Document(page_content="The forest is green", metadata={"document_id": "62"})
    docs = []
    docs.append(doc1)
    docs.append(doc2)

    for doc in docs:
        doc.metadata['summary'] = 'hello'

    pprint.pprint(docs)
    db = FAISS.from_documents(docs, embeddings)
    db.save_local("faiss_index")
    new_db = FAISS.load_local("faiss_index", embeddings)

    query = "Which color is the sky?"
    docs = new_db.similarity_search_with_score(query)
    print('Retrieved docs:', docs)
    print('Metadata of the most relevant document:', docs[0][0].metadata)

if __name__ == '__main__':
    main()

Tested with Python 3.11 with:

pip install langchain==0.1.1 langchain_openai==0.0.2.post1 sentence-transformers==2.2.2 langchain_community==0.0.13 faiss-cpu==1.7.4

The vector store is created with db = FAISS.from_documents(docs, embeddings).


Solution

  • Langchain does not natively support any progress bar for this at the moment with release of 1.0.0

    I also had similar case, so instead of sending all the documents, I send independent document for ingestion and tracked progress at my end. This was helpful for me.

    You can do the ingestion in the following way

        with tqdm(total=len(docs), desc="Ingesting documents") as pbar:
            for d in docs:
                if db:
                    db.add_documents([d])
                else:
                    db = FAISS.from_documents([d], embeddings)
                pbar.update(1)  
    
    

    From what I checked from langchain code https://github.com/langchain-ai/langchain/blob/master/libs/core/langchain_core/retrievers.py#L31 they are making call to add_texts as well, so no major operation is being performed here other than parsing.

    I had simple documents, and I didn't observe much difference. Probably others who has tried on huge documents can add if it adds latency in their usecase.

    Below is your updated code

    import pprint
    from tqdm import tqdm
    from langchain_community.vectorstores import FAISS
    from langchain_community.embeddings import HuggingFaceEmbeddings
    from langchain.docstore.document import Document
    
    model = "sentence-transformers/multi-qa-MiniLM-L6-cos-v1"
    embeddings = HuggingFaceEmbeddings(model_name = model)
    
    def main():
        doc1 = Document(page_content="The sky is blue.",    metadata={"document_id": "10"})
        doc2 = Document(page_content="The forest is green", metadata={"document_id": "62"})
        docs = []
        docs.append(doc1)
        docs.append(doc2)
    
        for doc in docs:
            doc.metadata['summary'] = 'hello'
    
        db = None
        with tqdm(total=len(docs), desc="Ingesting documents") as pbar:
            for d in docs:
                if db:
                    db.add_documents([d])
                else:
                    db = FAISS.from_documents([d], embeddings)
                pbar.update(1)  
    
        # pprint.pprint(docs)
        # db = FAISS.from_documents(docs, embeddings)
        db.save_local("faiss_index")
        new_db = FAISS.load_local("faiss_index", embeddings)
    
        query = "Which color is the sky?"
        docs = new_db.similarity_search_with_score(query)
        print('Retrieved docs:', docs)
        print('Metadata of the most relevant document:', docs[0][0].metadata)
    
    if __name__ == '__main__':
        main()