hashnlpembeddingn-gramfasttext

Does hashing in Fasttext lead to different ngrams sharing the same embedding?


As per Section 3.2 in the original paper on Fasttext, the authors state:

In order to bound the memory requirements of our model, we use a hashing function that maps n-grams to integers in 1 to K

Does this mean the model computes only K embeddings regardless of the number of distinct ngrams extracted from the training corpus, and if 2 different ngrams collide when hashed, they share the same embedding?

Thanks.


Solution

  • Yes: the character n-grams go into a collision-oblivious hash-table. Some slots will be trained for one n-gram once, then others.

    In practice, the ones that are most-saliently meaningful tend to dominate their slot, & don't do too much damage to any collisions because of the many n-grams being combined per word. And, the OOV word synthesized -from-ngram vectors remain better-than-nothing.

    You can change the number of buckets at model-creation time via the similarly-named parameter, if your use is especially different (especially in overall training corpus size) from what drive their choice of the default. (IIRC, the default is 3 million bucket slots.)

    You could potentially use fewer buckets to save memory – but then you'd have more collisions, and likely lower-quality OOV vectors.