[SOLVED] Binary Classification [Text] based on Embedding Distance?

Binary Classification [Text] based on Embedding Distance?

I have 30 news articles that I love and would like to find more of. I created embeddings using DistilBert and saved to Faiss and Milvus in dbs called ILoveTheseArticles (I'm trying both out). These feature vectors have the same dimensions and max characters. As I bring in more news I would like to vectorize each new article and find the nearest top article in the ILoveTheseArticles to get the distance. Based on that distance I would like to keep or discard the new article almost as a binary classifier where I don't need to constantly train a kernel every time I add new similar articles.

Figure 1

Figure 1 - medium article

As a Cosine Similarity example (Figure1) if OA and OB exist in ILoveTheseArticles and I search with a new embedding OC I would get OB closest to OC at 0.86 and if the threshold for keeping is say 0.51 I would keep the 0C article as it was similar to an article that I love.

As an L2 example (Figure1) if A' and B' exist in ILoveTheseArticles and I search for C' with a threshold of say 10.5 I would reject C' as B' is closest to C' at 20.62.

Is it possible to infer similar news articles using this approach with embeddings and distance? I second guess this approach when I read confusing answers to a similar-ish question. Is Cosine Similarity or IP better then L2 or vice versa in this scenario?

Solution

This is a great way of finding similar articles. When it comes to the different distance calculations, there isn't too much of a difference between them as the embeddings from distilbert aren't based on word frequencies anymore, but on the weights assigned by the model for the overall text. Whichever you chose should return similar rankings as an output of the similarity search.

The harder part will be figuring out where to set your cutoff and I believe this will come down to manually checking results to see what a good limit would be.