spacyvocabularynightly-buildvector-space

spacy nightly (3.0.0rc) load without vocab how to add word2vec vectorspace?


In spacy 2 I use this to add a vocab to an empty spacy model with vectorspace (spacy init) :

nlp3=spacy.load('nl_core_news_sm') #standard model without vectors
spacy.load("spacyinitnlmodelwithvectorspace",vocab=nlp3.vocab)

In spacy nightly version 3.0.0rc the vocab parameter is not in spacy.load anymore. Has anyone a suggesstion how I can add vocab to a spacy model?


Solution

  • this works, from Export vectors from fastText to spaCy add's vecfile to spacy model. only tested on small dataset

    from future import unicode_literals

    import numpy import spacy

    def spacy_load_vec(spacy_model,vec_file,spacy_vec_model,print_words=False): """ spacy model zonder vectoren + vecfile wordt spacy model met vectorspace Export vectors from fastText to spaCy

    Parameters
    ----------
    spacy_model : TYPE
        spacy model zonder vectorspace.
    vec_file : TYPE
        vecfile met fasttext of w2v getrainde vectoren.
    spacy_vec_model : TYPE
        spacy model met vectorspace.
    print_words : TYPE, optional
        woorden printen True/false. The default is False.
    
    Returns
    -------
    None.
    
    """
    nlp = spacy.load(spacy_model)
    with open(vec_file, 'rb') as file_:
        header = file_.readline()
        nr_row, nr_dim = header.split()
        nlp.vocab.reset_vectors(width=int(nr_dim))
        count = 0
        for line in file_:
            count += 1
            line = line.rstrip().decode('utf8')
            pieces = line.rsplit(' ', int(nr_dim))
            word = pieces[0]
            if print_words:
                print("{} - {}".format(count, word)) 
            vector = numpy.asarray([float(v) for v in pieces[1:]], dtype='f')
            nlp.vocab.set_vector(word, vector)  # add the vectors to the vocab
    nlp.to_disk(spacy_vec_model)