pythonnlpgensimfasttext

Why isn't my Gensim fastText model continuing to train on a new corpus?


I am trying to continue training a fastText model with Gensim, using my own corpus of text.

I've followed along with the documentation here: https://radimrehurek.com/gensim/models/fasttext.html

And I have written the following code:

First, create a small corpus:

corpus = [
    "The brown dog jumps over the kangaroo",
    "I want to ride my bicycle to Mount Everest",
    "What a lovely day it is",
    "When I Wagagamagga, everybody stops to listen"
]

corpus = [sentence.split() for sentence in corpus]

And then load a testing model:

from gensim.models.fasttext import load_facebook_model
from gensim.test.utils import datapath

model = load_facebook_model(datapath("crime-and-punishment.bin"))

Then I do a check to see if the model knows my weird new word in the corpus:

'Wagagamagga' in model.wv.key_to_index

Which returns False.

Then I try to continue the training:

model.build_vocab(corpus, update=True)
model.train(corpus, total_examples=len(corpus), epochs=model.epochs)

The model should know about my weird new word now, but this returns False, when I am expecting it to return True:

'Wagagamagga' in model.wv.key_to_index

What have I missed?


Solution

  • Models generally have a min_count value of at least 5 - meaning words with fewer occurrences are ignored. Discarding the rarest words typically improves model quality, as both:

    1. such rare words have too few usage examples to get a good vector themselves; and further…
    2. by pushing surrounding words outside each others' windows, and spending training cycles & internal-weight updates on a vector that still won't be good, they make other word-vectors worse

    With larger training data, increasing the min_count even higher makes sense.

    So, your problem is likely because a single occurence of that word is insufficient to make it a tracked word. Using a larger, varied corpus with multiple contrasting usage examples, at least as many as the model.min_count value, would be the best fix.

    Separately: note that it is always better to train a model with all data at the same time.

    Incremental updates will execute, but introduce thorny issues of relative balance between older & newer sessions. To the extent a new session uses only a subset of words and representative word-usages, those words-included can be nudged by training out of comparable alignment with words only known in earlier sessions.

    So if trying incremental updates, make sure your quality-evaluations are strong enough to detect if the model is actually improving, or gettings worse, on your real goals.