pythonvectorscikit-learntf-idflsa

Transforming words into Latent Semantic Analysis (LSA) Vectors


Does anyone have any suggestions for how to turn words from a document into LSA vectors using Python and scikit-learn? I found these site here and here that decscribe how to turn a whole document into an lsa vector but I am interested in converting the individual words themselves.

The end result is to sum all the vectors (representing each word) from every sentence and then compare consecutive sentences to assess semantic similarity.


Solution

  • Turning a sentence or a word into a vector is not different than doing so with documents, a sentence is just like a short document and a word is like a very very short one. From first link we have the code for mapping a document to a vector:

    def makeVector(self, wordString):
            """ @pre: unique(vectorIndex) """
    
            #Initialise vector with 0's
            vector = [0] * len(self.vectorKeywordIndex)
            wordList = self.parser.tokenise(wordString)
            wordList = self.parser.removeStopWords(wordList)
            for word in wordList:
                    vector[self.vectorKeywordIndex[word]] += 1; #Use simple Term Count Model
            return vector
    

    Same function can be used to map a sentence or a single word to a vector. Just pass them to this function. for a word, the result of wordList would be an array holding a single value, something like: ["word"] and then after mapping, the result vector would be a unit vector containing a 1 in associated dimension and 0s elsewhere.

    Example:

    vectorKeywordIndex (representing all words in vocabulary):

    {"hello" : 0, "world" : 1, "this" : 2, "is" : 3, "me" : 4, "answer" : 5}
    

    document "this is me": [0, 0, 1, 1, 1, 0]

    document "hello answer me": [1, 0, 0, 0, 1, 1]

    word "hello": [1, 0, 0, 0, 0, 0]

    word "me": [0, 0, 0, 0, 1, 0]

    after that similarity can be assessed through several criteria like cosine similarity using this code:

    def cosine(vector1, vector2):
            """ related documents j and q are in the concept space by comparing the vectors using the code:
                    cosine  = ( V1 * V2 ) / ||V1|| x ||V2|| """
            return float(dot(vector1,vector2) / (norm(vector1) * norm(vector2)))
    

    or by using scikit-learn's sklearn.metrics.pairwise.cosine_similarity.

    from sklearn.metrics.pairwise import cosine_similarity
    sim = cosine_similarity(x, y)