nlpfasttextglove

Create word embeddings without keeping fastText Vector file in the repository


I am trying to embed a sentence with the help of Infersent, and Infersent uses fastText vectors for word embedding. The fastText vector file is close to 5 GiB.

When we keep the fastText vector file along with the code repository it makes the repository size huge, and makes the code difficult to share/deploy (even creating a docker container).

Is there any method to avoid keeping the vector file along with the repository, but reuse it for embedding new sentences?


Solution

  • What kind of sentences are you embedding, is it the same domain as the one on which fastText embeddings were generated?

    Try to get a representation of your data in tokens i.e, a set of all tokens, or some representations of the most common tokens that appear in the sentences you want to embed using fastText.

    Compute the overlap of your tokens with the tokens in fastText, remove the ones from fastText which don't appear in your data representation.

    I did that recently and went from a 1.4GB file with some pre-trained word embeddings to 200 MB, mainly because the overlap with my corpus was around 10%.