pythonnlpword2vecdoc2vec

Are the document vectors used in doc2vec one-hot?


I understand conceptually how word2vec and doc2vec work, but am struggling with the nuts and bolts of how the numbers in the vectors get processed algorithmically.

If the vectors for three context words are: [1000], [0100], [0010]

and the vector for the target word is [0001], does the algorithm perform one backward pass for each input/target output pair, like this:

[1000]-->[0001]
[0100]-->[0001]
[0010]-->[0001]

or are the input (context) vectors added together, like this:

[1110]-->[0001]

or is some other process used?

Additionally, do the document vectors used in doc2vec take the one-hot form of the word vectors, or are documents tagged with individual numbers on a continuous scale, like 1, 2, 3, etc.?

I get that the document tags are included as input nodes during the training process, but how are they used in the test phase? When entering the context word vectors to try to predict the target word (or vice versa) during testing, shouldn't an input for some document ID be required as well?


Solution

  • No, the vectors created by Word2Vec or the 'Paragraph Vectors' form of Doc2Vec are 'dense embeddings' – scattered continuous real-valued coordinates throughout a smaller number of dimensions, rather than 0/1 coordinates in a very-high number of dimensions.

    It's possible to think of parts of the training as having a 'one-hot' encoding of the presence or absence of a word, or of a particular document-ID – with these raw 'one-hot' layers then activating a 'projection' layer that maps/averages the one-hots into a dense space. But the implementations I'm familiar with, such as the original Google word2vec.c or Python gensim, don't ever realize giant vocabulary-sized one-hot vectors.

    Rather, they use words/document-tags as lookup keys to select the right dense vectors for later operations. These looked-up dense vectors start at random low-magnitude coordinates, but then get continually adjusted by training until they reach the useful distance/direction arrangements for which people use Word2Vec/PV-Doc2Vec.

    So in skip-gram, the word 'apple' will pull up a vector, initially random, and that context vector is forward-propagated to see how well it predicts a specific in-window target word. Then, nudges to all values (including to the 'apple' vector's individual dimensions) are applied to make the prediction slightly better.

    In PV-Doc2Vec PV-DBOW, the document ID 'doc#123' (or perhaps just the int slot 123) will pull up a candidate vector for that document, initially random, and evaluated/nudged for how well it predicts the words in that document.

    Word2Vec CBOW and Doc2Vec PV-DM involve some extra averaging of multiple candidate vectors together before forward-propagation, and then fractional distribution of the nudges back across all vectors that combined to make the context, but it's still the same general approach – and involves working with dense continuous vectors (often of 100-1000 dimensions) rather than one-hot vectors (of dimensionality as large as the whole vocabulary, or whole document-set-size).