scikit-learnnlpcosine-similarityspacyterm-document-matrix

adding a new document to the term document matrix for similarity calculations


So I am aware there are several methods for finding a most similar or say three most similar documents in a corpus of documents. I know there can be scaling issues, for now I have around ten thousand documents and have been running tests on a subset of around thirty. This is what I've got for now but am considering looking into elasticsearch or doc2vec if this proves to be impossible or inefficient.

The scripts work very nicely so far, they use spaCy to tokenise the text and Sklearn TfidfVectorizer to fit accross all the documents, and very similar documents are found. I notice that shape of my NumPy object coming out of the pipeline is (33, 104354) which probably implies 104354 vocab excluding stopwords across all the 33 documents. That step takes a good twenty minutes to run, but the next step being a matrix multiplication which computes all the cosine similarities is very quick, but I know it might slow down as that matrix gets thousands rather than thirty rows.

If you could efficiently add a new document to the matrix, it wouldn't matter if the initial compute took ten hours or even days if you saved the result of that compute.

  1. When I press tab after the . there seems to be a method on the vectorizer called vectorizer.fixed_vocabulary_ . I can't find this method on google or in SKlearn. Anyway, when I run the method, it returns False. Does anyone know what this is? Am thinking it might be useful to fix the vocabulary if possible otherwise it might be troublesome to add a new document to the term document matrix, although am not sure how to do that.

Someone asked a similar question here which got voted up but nobody ever answered.

He wrote:

For new documents what do I do when I get a new document doc(k)? Well, I have to compute the similarity of this document with all the previous ones, which doesn't require to build a whole matrix. I can just take the inner-product of doc(k) dot doc(j) for all previous j, and that result in S(k, j), which is great.

  1. Does anyone understand exactly what he means here or have any good links where this rather obscure topic is explained? Is he right? I somehow think that the ability to add new documents with this inner-product, if he is right, will depend on fixing the vocabulary as mentioned above.

Solution

  • Ok, I solved it, took many hours, the other post around this topic confused me with the way it described the linear algebra, and failed to mention an aspect of it which perhaps was obvious to the guy who wrote it.

    So thanks for the information about vocabulary..

    So vectorizer was an instance of sklearn.feature_extraction.text.vectorizer. I used the vocabulary_ method to pull the vocabulary of the existing 33 texts out:

    v = vectorizer.vocabulary_
    print (type(v))
    >> dict
    print (len(v))
    >> 104354
    

    Pickled this dictionary for future use and just to test if it worked, reran fit_transform on the pipeline object containing the TfidfVectorizer with the parameter vocabulary=v which it did.

    The original pairwise similarity matrix was found by pairwise_similarity = (p * p.T).A where p is a fitted pipeline object, also the term document matrix.

    Added a small new document:

    new_document= """
    
    Remove the lamb from the fridge 1 hour before you want to cook it, to let it come up to room temperature. Preheat the oven to 200ºC/400ºC/gas 6 and place a roasting dish for the potatoes on the bottom. Break the garlic bulb up into cloves, then peel 3, leaving the rest whole.
    """
    

    Fitted the pipeline to just the one document, with its now fixed vocabulary:

    p_new = pipe.fit_transform([new_document]) 
    print (p_new.shape)
    > (1, 104354)
    

    Then put them together like this:

    from scipy.sparse import vstack as vstack_sparse_matrices
    p_combined = vstack_sparse_matrices([p, p_new])
    print (p_combined.shape)
    >> (34, 104354)
    

    and reran the pairwise similarity equation:

    pairwise_similarity = (p_combined * p_combined.T).A
    

    Was not totally confident on the code or theory, but I believe this is correct and has worked - the proof of the pudding is in the eating and my later code found the most similar documents to also be cooking related. Changed the original document to several other topics and reran it all, and the similarities were exactly as you would expect them to be.