pythonmongodblangchainpy-langchainvector-search

I get an empty array from vector serch in mongoDB with langchain


I have the code:

loader = PyPDFLoader(“https://arxiv.org/pdf/2303.08774.pdf”)
data = loader.load()
docs = text_splitter1.split_documents(data)
vector_search_index = “vector_index”

vector_search = MongoDBAtlasVectorSearch.from_documents(
  documents=docs,
  embedding=OpenAIEmbeddings(disallowed_special=()),
  collection=atlas_collection,
  index_name=vector_search_index,
)

query = "What were the compute requirements for training GPT 4"
results = vector_search1.similarity_search(query)
print("result: ", results)

And in results I have every time only empty array. I don't understand what I do wrong. This is the link on the langchain documentation with examples. Information is saved normally in database, but I cannot search info in this collection.


Solution

  • So I was able to get this to work in MongoDB with the following code:

    text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=150)
    
    loader = PyPDFLoader("https://arxiv.org/pdf/2303.08774.pdf")
    data = loader.load()
    docs = text_splitter.split_documents(data)
    
    DB_NAME = "langchain_db"
    COLLECTION_NAME = "atlas_collection"
    ATLAS_VECTOR_SEARCH_INDEX_NAME = "vector_index"
    MONGODB_ATLAS_CLUSTER_URI = uri = os.environ.get("MONGO_DB_ENDPOINT")
    
    client = MongoClient(MONGODB_ATLAS_CLUSTER_URI)
    MONGODB_COLLECTION = client[DB_NAME][COLLECTION_NAME]
    
    vector_search = MongoDBAtlasVectorSearch.from_documents(
        documents=docs,
        embedding=OpenAIEmbeddings(disallowed_special=()),
        collection=MONGODB_COLLECTION,
        index_name=ATLAS_VECTOR_SEARCH_INDEX_NAME,
    )
    
    query = "What were the compute requirements for training GPT 4"
    results = vector_search.similarity_search(query)
    print("result: ", results)
    

    At this point, I did get the same results that you did. Before it would work, I had to create the vector search index and I made sure it was named the same as what is specified in ATLAS_VECTOR_SEARCH_INDEX_NAME:

    enter image description here

    FWIW - It was easier for me to do in Astra DB (I tried this first, because I am a DataStax employee):

    text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=150)
    
    loader = PyPDFLoader("https://arxiv.org/pdf/2303.08774.pdf")
    data = loader.load()
    docs = text_splitter.split_documents(data)
    atlas_collection = "atlas_collection"
    
    ASTRA_DB_API_ENDPOINT = os.environ.get("ASTRA_DB_API_ENDPOINT")
    ASTRA_DB_APPLICATION_TOKEN = os.environ.get("ASTRA_DB_APPLICATION_TOKEN")
    
    vector_search = AstraDBVectorStore.from_documents(
      documents=docs,
      embedding=OpenAIEmbeddings(disallowed_special=()),
      collection_name=atlas_collection,
      api_endpoint=ASTRA_DB_API_ENDPOINT,
      token=ASTRA_DB_APPLICATION_TOKEN,
    )
    
    query = "What were the compute requirements for training GPT 4"
    results = vector_search.similarity_search(query)
    print("result: ", results)
    

    Worth noting, that Astra DB will create your vector index automatically based on the dimensions of the embedding model.