pythongensimword-embeddingglove

Load a part of Glove vectors with gensim


I have a word list like['like','Python']and I want to load pre-trained Glove word vectors of these words, but the Glove file is too large, is there any fast way to do it?

What I tried

I iterated through each line of the file to see if the word is in the list and add it to a dict if True. But this method is a little slow.

def readWordEmbeddingVector(Wrd):
    f = open('glove.twitter.27B/glove.twitter.27B.200d.txt','r')
    words = []
    a = f.readline()
    while a!= '':
        vector = a.split()
        if vector[0] in Wrd:
            words.append(vector)
            Wrd.remove(vector[0])
        a = f.readline()
    f.close()
    words_vector = pd.DataFrame(words).set_index(0).astype('float')
    return words_vector

I also tried below, but it loaded the whole file instead of vectors I need

gensim.models.keyedvectors.KeyedVectors.load_word2vec_format('word2vec.twitter.27B.200d.txt')

What I want

Method like gensim.models.keyedvectors.KeyedVectors.load_word2vec_format but I can set a word list to load.


Solution

  • There's no existing gensim support for filtering the words loaded via load_word2vec_format(). The closest is an optional limit parameter, which can be used to limit how many word-vectors are read (ignoring all subsequent vectors).

    You could conceivably create your own routine to perform such filtering, using the source code for load_word2vec_format() as a model. As a practical matter, you might have to read the file twice: 1st, to find out exactly how many words in the file intersect with your set-of-words-of-interest (so you can allocate the right-sized array without trusting the declared size at the front of the file), then a second time to actually read the words-of-interest.