wikipediagensimword2vec

Using a Word2Vec model pre-trained on wikipedia


I need to use gensim to get vector representations of words, and I figure the best thing to use would be a word2vec module that's pre-trained on the english wikipedia corpus. Does anyone know where to download it, how to install it, and how to use gensim to create the vectors?


Solution

  • @imanzabet provided useful links with pre-trained vectors, but if you want to train the models yourself using genism than you need to do two things:

    1. Acquire the Wikipedia data, which you can access here. Looks like the most recent snapshot of English Wikipedia was on the 20th, and it can be found here. I believe the other English-language "wikis" e.g. quotes are captured separately, so if you want to include them you'll need to download those as well.

    2. Load the data and use it to generate the models. That's a fairly broad question, so I'll just link you to the excellent genism documentation and word2vec tutorial.