matrixindexingdocumentdocumentscarrot2

Carrot2 documents similarity and how are the ordered documents indexes in the tf-idf matrix?


I'm trying to determine the similarity between two documents using carrot. Is it possible get this similarity directly from the framework?

Additionally I've been studying the tf-idf matrix and realized that the rows correspond to the stemmed all words and columns to documents. However, how can I identify which document corresponds to which column?

For example, suppose a list of documents, the column order will be the order of the documents in the list?

Ex:

List docs = {doc1, doc2, doc3}

and

Column 0 = doc1 Coluns 1 = doc2

...

Is this?


Solution

  • Carrot2 does not use the conventional notion of document-document similarity, so you won't find it there. You can indeed use the term-document matrix to compute all sorts of document-document similarity.

    You are correct in assuming that the columns of the term-document matrix are in the same order as the documents in the input list. You can check the source code to clear any other doubts.