machine-learningunsupervised-learning

TF-IDF/Cosine Similarity - Similarity Histogram


I created a histogram with the similarity scores of all documents in a corpus. The scores were computed with TF-IDF/Cosine Similarity. See included image. I'm not 100% sure how to read the chart. Does the compactness of scores indicate that the corpus is closely related in a good way or closely related in a bad way? Or am I looking at this completely wrong?

tf = TfidfVectorizer(analyzer='word', ngram_range=(1, 2), min_df=5)
tfidf_matrix = tf.fit_transform(ds['clean_text'])
cosine_similarities = linear_kernel(tfidf_matrix, tfidf_matrix)

Solution

  • Looking at the histogram, It would seem that the document similarity is not that concentrated (Cosine simlarity is bounded [0,1], and your histogram range is ~0.2-1). Whether this is good or bad depends on your expectation of the data, and what you want to do with the TF-IDF matrix later on. If you have a diverse corpus (e.g. wikipedia) then you would expect a wide range and be suspicious if you had a narrow range of Cosine similarity scores. However, if your Corpus is derived from a highly similar set of documents (e.g. a book report from a class of students).

    In general, the distribution of your similarity scores is more just an FYI than a measure of dataset quality.