I am trying to load glove 100d emebddings in spacy nlp pipeline.
I create the vocabulary in spacy format as follows:
python -m spacy init-model en spacy.glove.model --vectors-loc glove.6B.100d.txt
glove.6B.100d.txt is converted to word2vec format by adding "400000 100" in the first line.
Now
spacy.glove.model/vocab has following files:
5468549 key2row
38430528 lexemes.bin
5485216 strings.json
160000128 vectors
In the code:
import spacy
nlp = spacy.load("en_core_web_md")
from spacy.vocab import Vocab
vocab = Vocab().from_disk('./spacy.glove.model/vocab')
nlp.vocab = vocab
print(len(nlp.vocab.strings))
print(nlp.vocab.vectors.shape) gives
gives 407174 (400000, 100)
However the problem is that:
V=nlp.vocab
max_rank = max(lex.rank for lex in V if lex.has_vector)
print(max_rank)
gives 0
I just want to use the 100d glove embeddings within spacy in combination with "tagger", "parser", "ner" models from en_core_web_md.
Does anyone know how to go about doing this correctly (is this possible)?
The tagger/parser/ner models are trained with the included word vectors as features, so if you replace them with different vectors you are going to break all those components.
You can use new vectors to train a new model, but replacing the vectors in a model with trained components is not going to work well. The tagger/parser/ner components will most likely provide nonsense results.
If you want 100d vectors instead of 300d vectors to save space, you can resize the vectors, which will truncate each entry to first 100 dimensions. The performance will go down a bit as a result.
import spacy
nlp = spacy.load("en_core_web_md")
assert nlp.vocab.vectors.shape == (20000, 300)
nlp.vocab.vectors.resize((20000, 100))