I have the following code which loads my pdf file generates embeddings and stores them in a vector db. I can then use it to preform searches on it.
The issue is that every time i run it the embeddings are regrated and stored in the db along with the ones already created.
Im trying to figurer out How to load an existing vector db into Langchain. rather then recreating them every time the app runs.
def load_embeddings(store, file):
# delete the dir
# shutil.rmtree(store) # I have to delete it or it just loads double data
loader = PyPDFLoader(file)
text_splitter = CharacterTextSplitter(
separator="\n",
chunk_size=1000,
chunk_overlap=200,
length_function=len,
is_separator_regex=False,
)
pages = loader.load_and_split(text_splitter)
return DocArrayHnswSearch.from_documents(
pages, GooglePalmEmbeddings(), work_dir=store + "/", n_dim=768
)
db = load_embeddings("linda_store", "linda.pdf")
embeddings = GooglePalmEmbeddings()
query = "Have I worked with Oauth?"
embedding_vector = embeddings.embed_query(query)
docs = db.similarity_search_by_vector(embedding_vector)
for i in range(len(docs)):
print(i, docs[i])
This works fine but if I run it again it just loads the file again into the vector db. I want it to just use the db after I have created it and not create it again.
I cant seem to find a method for loading it I tried
db = DocArrayHnswSearch.load("hnswlib_store/", embeddings)
But thats a no go.
Your load_embeddings
function is recreating the database every time you call it. Here's why:
...
# We don't need this when loading from store
loader = PyPDFLoader(file)
...
...
# We don't need to pass pages when loading from store
return DocArrayHnswSearch.from_documents(
pages, GooglePalmEmbeddings(), work_dir=store + "/", n_dim=768
)
...
def query_vector_store(query):
embeddings = OpenAIEmbeddings(openai_api_key=open_ai_key)
vector_store = DocArrayHnswSearch.from_params(embeddings, "store/", 1536)
embedding_vector = embeddings.embed_query(query)
return vector_store.similarity_search_by_vector(embedding_vector)
I am using OpenAIEmbeddings()
here but the same code should apply to GooglePalmEmbeddings()
just make sure you update the value of the dimension.
We're using DocArrayHnswSearch.from_params
instead to load embeddings from the store (see here). This method does not expect the documents.
vector_store
to perform similarity searchAs you can see from the query_vector_store(query: str)
function above, we're not re-loading the documents from the PDF loader every time. Instead, we're just passing in our embeddings, work directory, and dimensions.
You can use the method as such: query_vector_store('YOUR_QUERY')
.
Based on your for loop here:
for i in range(len(docs)):
print(i, docs[i])
You'll see the documents sorted by most similar.
I hope this helps!