pythonnlpgensimunsupervised-learningfasttext

word-embedding: Convert supervised model into unsupervised model


I want to load an pre-trained embedding to initialize my own unsupervise FastText model and retrain with my dataset.

The trained embedding file I have loads fine with gensim.models.KeyedVectors.load_word2vec_format('model.txt'). But when I try: FastText.load_fasttext_format('model.txt') I get: NotImplementedError: Supervised fastText models are not supported.

Is there any way to convert supervised KeyedVectors to unsupervised FastText? And if possible, is it a bad idea?

I know that has an great difference between supervised and unsupervised models. But I really wanna try use/convert this and retrain it. I'm not finding a trained unsupervised model to load for my case (it's a portuguese dataset), and the best model I find is that


Solution

  • If your model.txt file loads OK with KeyedVectors.load_word2vec_format('model.txt'), then that's just a simple set of word-vectors. (That is, not a 'supervised' model.)

    However, Gensim's FastText doesn't support preloading a simple set of vectors for further training - for continued training, it needs a full FastText model, either from Facebook's binary format, or a prior Gensim FastText model .save().

    (That trying to load a plain-vectors file generates that error suggests the load_fasttext_format() method is momentarily mis-interpreting it as some other kind of binary FastText model it doesn't support.)

    Update after comment below:

    Of course you can mutate a model however you like, including ways not officially supported by Gensim. Whether that's helpful is another matter.

    You can create an FT model with a compatible/overlapping vocabulary, load old word-vectors separately, then copy each prior vector over to replace the corresponding (randomly-initialized) vectors in the new model. (Note that the property to affect further training is actually ftModel.wv.vectors_vocab trained-up full-word vectors, not the .vectors which is composited from full-words & ngrams,)

    But the tradeoffs of such an ad-hoc strategy are many. The ngrams would still start random. Taking some prior model's just-word vectors isn't quite the same as a FastText model's full-words-to-be-later-mixed-with-ngrams.

    You'd want to make sure your new model's sense of word-frequencies is meaningful, as those affect further training - but that data isn't usually available with a plain-text prior word-vector set. (You could plausibly synthesize a good-enough set of frequencies by assuming a Zipf distribution.)

    Your further training might get a "running start" from such initialization - but that wouldn't necessarily mean the end-vectors remain comparable to the starting ones. (All positions may be arbitrarily changed by the volume of newer training, progressively diluting away most of the prior influence.)

    So: you'd be in an improvised/experimental setup, somewhat far from usual FastText practices and thus where you'd want to re-verify lots of assumptions, and rigorously evaluate if those extra steps/approximations are actually improving things.