pythonword2vecspacy

Update spaCy Vocabulary


I was wondering if it is possible to update spacys default vocabulary. What I am trying doing is this:

But since a lot of words in my corpus aren't in spacys default vocabulary I can't make use of the imported vectors. Is there an (easy) way to add those missing types?

Edit:
I realize it might be problematic to mix vectors. So my question is:
How can I import a custom vocabulary into spacy?


Solution

  • This is much easier in the next version, which should be out this week --- I'm just finishing testing it. For now:

    By default spaCy loads a data/vocab/vec.bin file, where the "data" directory is within the spacy.en module directory Create the vec.bin file from a bz2 file using spacy.vocab.write_binary_vectors Either replace spaCy's vec.bin file, or call nlp.vocab.load_rep_vectors at run-time, with the path to the binary file. The above is a bit inconvenient at first, but the binary file format is much smaller and faster to load, and the vectors files are fairly big. Note that GloVe distributes in gzip format, not bzip.

    Out of interest: are you using the GloVe vectors, or something you trained on your own data? If your own data, did you use Gensim? I'd like to make this much easier, so I'd appreciate suggestions for what work-flow you'd like to see.

    Load new vectors at run-time, optionally converting them

        import spacy.vocab
    
        def set_spacy_vectors(nlp, binary_loc, bz2_loc=None):
            if bz2_loc is not None:
                spacy.vocab.write_binary_vectors(bz2_loc, binary_loc)
            write_binary_vectors(bz2_input_loc, binary_loc)
    
            nlp.vocab.load_rep_vectors(binary_loc)
    

    Replace the vec.bin, so your vectors will be loaded by default

    from spacy.vocab import write_binary_vectors
        import spacy.en
    
        from os import path
    
        def main(bz2_loc):
            bin_loc = path.join(path.dirname(spacy.en.__file__), 'data', 'vocab', 'vec.bin')
            write_binary_vectors(bz2_loc, bin_loc)
    
    if __name__ == '__main__':
        plac.call(main)