What I want to achieve: I have thousands of documents (descriptions of incidents) and I would like to find the documents which match a phrase or are similar to the words in the phrase. An example, for an input phrase, "electric vehicle", I would like to find all the documents that has any discussion related to anything happening with any type of electric vehicle or conveyance, the documents in the corpus might not have the word "vehicle", but may have the specific vehicle type mentioned, like "scooter", "bicycle", "hoverboard" etc,. and document may have the word "electrical" or even something like "lithium battery of a ". So, from an input phrase like "an electric vehicle" or "an electric automobile" or "vehicle powered by a lithium-ion battery", I need to find out all the documents that has related mentions to that term. But, I don't want to capture the documents with "automobile", "scooter" that doesn't have any mention of "electric" or "lithium-ion". So, from a phrase with 1 to 4 words, I must find matching documents containing anywhere from 2 to 100 words used for 1 to 7 sentences in each document.
And the list of input phrases (that are used to find matching documents) will vary, hence something like Siamese-networks or even training a classification model can't be done I suppose. And the count of documents will also keep increasing by day and each of the document is independent of each other.
Here's what I have done till now:
I have used sentence-transformers (tried the pre-trained models, multi-qa-mpnet-base-dot-v1
, all-MiniLM-L12-v2
, all-MiniLM-L16-v2
and all-mpnet-base-v2
), to get normalized embeddings for all the documents, then my input phrase. and then computed cosine-similarity between my input phrase's embeddings with all the documents, then get the top 20 sentences with highest values.
The matched documents were barely relevant. For ex, for input phrase "an electrical vehicle" matches documents, with highest cosine-similarity, containing nothing but the word "electrical", followed by documents with only "vehicle", then documents with only "electrical vehicle" or a bit more words or the same 2 words in different forms, followed by documents just a bit more words but having mentions only of "vehicle" without "electrical" and vice-versa. I presume, because of the less count of words in the input phrase.
How do I counter this and find documents that actually mention all the words in my input phrase instead of just using one word to find the matching documents?
In general your approach so far seem sensible and you should see more relevant search results. I suggest these improvements:
You could also provide a minimal reproducible example. This might help to give more detailed recommendations.