I'm trying to use Sentence Transformers and Haystack for document retrieval, focusing on searching documents on other metadata beside document text.
I'm using a dataset of academic publication titles, and I've appended a fake publication year (which I want to use as a search term). From reading around I've combined the columns and just added a separator between the title and publication year, and included the column titles since I thought maybe this could add context. An example input looks like:
title Sparsity-certifying Graph Decompositions [SEP] published year 1980
I have a document store and method of retrieving here, based on this:
document_store_faiss = FAISSDocumentStore(faiss_index_factory_str="Flat",
return_embedding=True,
similarity='cosine')
retriever_faiss = EmbeddingRetriever(document_store_faiss,
embedding_model='all-mpnet-base-v2',
model_format='sentence_transformers')
document_store_faiss.write_documents(df.rename(columns={'combined':'content'}).to_dict(orient='records'))
document_store_faiss.update_embeddings(retriever=retriever_faiss)
def get_results(query, retriever, n_docs = 25):
return [(item.content) for item in retriever.retrieve(q, top_k = n_docs)]
q = 'published year 1999'
print('Results: ')
res = get_results(q, retriever_faiss)
for r in res:
print(r)
I do a check to see if any inputs actually have a publication year matching the search term, but when I look at my search results I'm getting entries with seemingly random published years. I was hoping that at least the results would all be the same published year, since I hoped to do more complicated queries like "published year before 1980".
If anyone could either tell me what I'm doing wrong, or whether I have misunderstood this process / expected results it would be much appreciated.
It sounds like you need metadata filtering rather than placing the year within the query itself. The FaissDocumentStore
doesn't support filtering, I'd recommend switching to the PineconeDocumentStore
which Haystack introduced in the v1.3 release a few days ago. It supports the strongest filter functionality in the current set of document stores.
You will need to make sure you have the latest version of Haystack installed, and it needs an additional pinecone-client
library too:
pip install -U farm-haystack pinecone-client
There's a guide here that may help, it will go something like:
document_store = PineconeDocumentStore(
api_key="<API_KEY>", # from https://app.pinecone.io
environment="us-west1-gcp"
)
retriever = EmbeddingRetriever(
document_store,
embedding_model='all-mpnet-base-v2',
model_format='sentence_transformers'
)
Before you write the documents you need to convert the data to include your text in content
(as you have done above, but no need to pre-append the year), and then include the year as a field in a meta
dictionary. So you would create a list of dictionaries that look like:
dicts = [
{'content': 'your text here', 'meta': {'year': 1999}},
{'content': 'another record text', 'meta': {'year': 1971}},
...
]
I don't know the exact format of your df
but assuming it is something like:
text | year |
---|---|
"your text here" | 1999 |
"another record here" | 1971 |
We could write the following to reformat it:
df = df.rename(columns={'text': 'content'}) # you did this already
# create a new 'meta' column that contains {'year': <year>} data
df['meta'] = df['year'].apply(lambda x: {'year': x})
# we don't need the year column anymore so we drop it
df = df.drop(['year'], axis=1)
# now convert into the list of dictionaries format as you did before
dicts = df.to_dict(orient='records')
This data replaces the df dictionaries you write, so we would continue as so:
document_store.write_documents(dicts)
document_store.update_embeddings(retriever=retriever)
Now you can query with filters, for example to search for docs with the publish year of 1999 we use the condition "$eq"
(equals):
docs = retriever.retrieve(
"some query here",
top_k=25,
filters={
{"year": {"$eq": 1999}}
}
)
For published before 1980 we can use "$lt"
(less than):
docs = retriever.retrieve(
"some query here",
top_k=25,
filters={
{"year": {"$lt": 1980}}
}
)