machine-learningnlpword2vecdoc2vec

Build vocab in doc2vec


I have a list of abstracts and articles approx 500 in csv each paragraph contains approx 800 to 1000 words whenever I build vocab and print with words giving none and how I can improve results?

    lst_doc = doc.translate(str.maketrans('', '', string.punctuation))

    target_data = word_tokenize(lst_doc)

    train_data = list(read_data())

    model = gensim.models.doc2vec.Doc2Vec(vector_size=50, min_count=2, epochs=40)

    train_vocab = model.build_vocab(train_data)

    print(train_vocab)

   {train = model.train(train_data, total_examples=model.corpus_count, 
   epochs=model.epochs) }

Output: None


Solution

  • A call to build_vocab() only builds the vocabulary inside the model, for further usage. That function call doesn't return anything, so your train_vocab variable will be Python None.

    So, the behavior you're seeing is as expected, and you should say more about what your ultimate aims are, and what you'd want to see as steps towards those aims, if you're stuck.

    If you want to see reporting of the progress of your calls to build_vocab() or train(), you can set the logging level to INFO. This is always a usually a good idea working to learn a new library: even if initially the copious info shown is hard to understand, by reviewing it you'll start to see the various internal steps, and internal counts/timings/etc, that hint whehter things are doing well or poorly.

    You can also examine the state of the model and its various internal properties after the code has run.

    For example, the model.wv property contains, after build_vocab(), a Gensim KeyedVectors structure holding all the untrained ready-for-training vectors. You can ask for its length (len(model.wv) or examine the discovered active list of words (model.wv.index_to_key).

    Other comments: