nlpdeep-learningpytorchword-embeddingtorchtext

Use pretrained embedding in Spanish with Torchtext


I am using Torchtext in an NLP project. I have a pretrained embedding in my system, which I'd like to use. Therefore, I tried:

my_field.vocab.load_vectors(my_path)

But, apparently, this only accepts the names of a short list of pre-accepted embeddings, for some reason. In particular, I get this error:

Got string input vector "my_path", but allowed pretrained vectors are ['charngram.100d', 'fasttext.en.300d', ..., 'glove.6B.300d']

I found some people with similar problems, but the solutions I can find so far are "change Torchtext source code", which I would rather avoid if at all possible.

Is there any other way in which I can work with my pretrained embedding? A solution that allows to use another Spanish pretrained embedding is acceptable.

Some people seem to think it is not clear what I am asking. So, if the title and final question are not enough: "I need help using a pre-trained Spanish word-embedding in Torchtext".


Solution

  • It turns out there is a relatively simple way to do this without changing Torchtext's source code. Inspiration from this Github thread.

    1. Create numpy word-vector tensor

    You need to load your embedding so you end up with a numpy array with dimensions (number_of_words, word_vector_length):

    my_vecs_array[word_index] should return your corresponding word vector.

    IMPORTANT. The indices (word_index) for this array array MUST be taken from Torchtext's word-to-index dictionary (field.vocab.stoi). Otherwise Torchtext will point to the wrong vectors!

    Don't forget to convert to tensor:

    my_vecs_tensor = torch.from_numpy(my_vecs_array)
    

    2. Load array to Torchtext

    I don't think this step is really necessary because of the next one, but it allows to have the Torchtext field with both the dictionary and vectors in one place.

    my_field.vocab.set_vectors(my_field.vocab.stoi, my_vecs_tensor, word_vectors_length)
    

    3. Pass weights to model

    In your model you will declare the embedding like this:

    my_embedding = toch.nn.Embedding(vocab_len, word_vect_len)
    

    Then you can load your weights using:

    my_embedding.weight = torch.nn.Parameter(my_field.vocab.vectors, requires_grad=False)
    

    Use requires_grad=True if you want to train the embedding, use False if you want to freeze it.

    EDIT: It looks like there is another way that looks a bit easier! The improvement is that apparently you can pass the pre-trained word vectors directly during the vocabulary-building step, so that takes care of steps 1-2 here.