machine-learningsearchlangchainlarge-language-modelvector-database

Search for documents with similar texts


I have a document with three attributes: tags, location, and text.

Currently, I am indexing all of them using LangChain/pgvector/embeddings.

I have satisfactory results, but I want to know if there is a better way since I want to find one or more documents with a specific tag and location, but the text can vary drastically while still meaning the same thing. I thought about using embeddings/vector databases for this reason.

Would it also be a case of using RAG (Retrieval-Augmented Generation) to "teach" the LLM about some common abbreviations that it doesn't know?

import pandas as pd

from langchain_core.documents import Document
from langchain_postgres import PGVector
from langchain_postgres.vectorstores import PGVector
from langchain_openai.embeddings import OpenAIEmbeddings

connection = "postgresql+psycopg://langchain:langchain@localhost:5432/langchain"
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
collection_name = "notas_v0"

vectorstore = PGVector(
    embeddings=embeddings,
    collection_name=collection_name,
    connection=connection,
    use_jsonb=True,
)


# START INDEX

# df = pd.read_csv("notes.csv")
# df = df.dropna()  # .head(10000)
# df["tags"] = df["tags"].apply(
#     lambda x: [tag.strip() for tag in x.split(",") if tag.strip()]
# )


# long_texts = df["Texto Longo"].tolist()
# wc = df["Centro Trabalho Responsável"].tolist()
# notes = df["Nota"].tolist()
# tags = df["tags"].tolist()

# documents = list(
#     map(
#         lambda x: Document(
#             page_content=x[0], metadata={"wc": x[1], "note": x[2], "tags": x[3]}
#         ),
#         zip(long_texts, wc, notes, tags),
#     )
# )

# print(
#     [
#         vectorstore.add_documents(documents=documents[i : i + 100])
#         for i in range(0, len(documents), 100)
#     ]
# )
# print("Done.")

### END INDEX

### BEGIN QUERY

result = vectorstore.similarity_search_with_relevance_scores(
    "EVTD202301222707",
    filter={"note": {"$in": ["15310116"]}, "tags": {"$in": ["abcd", "xyz"]}},
    k=10, # Limit of results
)

### END QUERY

Solution

  • There is one primary unknown here, what is the approximate or average number of tokens in the "text" part of your input.

    Scenario 1: You do not have a very long input (say, somewhere around 512 tokens)

    In this case, to get better results, you can train your own "embedding-model", please look at my answer here which has some info around it.

    Once you get right embedding model, you index corresponding text vectors in you RAG pipeline. There are a couple of other steps as well which are applicable to all the scenarios, so, I will add them at the end.

    Scenario 2: You have a very long input per document, say, every "text" input is huge (say, ~8000 tokens, this number can be anything though). In this case you can leverage symbolic search instead of vector search. Symbolic search because, in any language, to describe something which really means the same or has similar context, there will surely be a lot of words overlap in source and target text. It will be very rare to find 10 pages text on a same topic that does not have a lot of work overlap.

    So, you can leverage symbolic search here, ensemble it with vector based validators and use an LLM service that allows long context prompts. So, you find some good candidates via symbolic searches, then, pass it on the long context LLM to for remaining parts.

    Steps Applicable to all the scenarios:

    1. You json object should also contain "tag", "location" along with "text" and "vector"

      {"text":"some text",
      "text_embedding":[...], #not applicable in symbolic search
      
      "location":"loc",
      "tags":[]
      }
      
    2. This way, when you get matches from either vector search or symbolic search; you will further able to filter or sort based on other properties like tags and location

    Please comment if you have more doubts!