pythonmachine-learninggensimword-embeddingphrase

Is there a pretrained Gensim phrase model?


Is there a pretrained Gensim's Phrases model? If not, would it be possible to reverse engineer and create a phrase model using a pretrained word embedding?

I am trying to use GoogleNews-vectors-negative300.bin with Gensim's Word2Vec. First, I need to map my words into phrases so that I can look up their vectors from the Google's pretrained embedding.

I search on the official Gensim's documentation but could not find any info. Thanks!


Solution

  • I'm not aware of anyone sharing a Phrases model. Any such model would be very sensitive to the preprocessing/tokenization step, and the specific parameters, the creator used.

    Other than the high-level algorithm description, I haven't seen Google's exact choices for tokenization/canonicalization/phrase-combination done to the data that fed into the GoogleNews 2013 word-vectors have been documented anywhere. Some guesses about preprocessing can be made by reviewing the tokens present, but I'm unaware of any code to apply similar choices to other text.

    You could try to mimic their unigram tokenization, then speculatively combine strings of unigrams into ever-longer multigrams up to some maximum, check if those combinations are present, and when not present, revert to the unigrams (or largest combination present). This might be expensive if done naively, but be amenable to optimizations if really important - especially for some subset of the more-frequent words – as the GoogleNews set appears to obey the convention of listing words in descending frequency.

    (In general, though it's a quick & easy starting set of word-vectors, I think GoogleNews is a bit over-relied upon. It will lack words/phrases and new senses that have developed since 2013, and any meanings it does capture are determined by news articles in the years leading up to 2013... which may not match the dominant senses of words in other domains. If your domain isn't specifically news, and you have sufficient data, deciding your own domain-specific tokenization/combination will likely perform better.)