Is it to aid in classification tasks? The [docs][1] and tutorials don't explain this; they seem to assume a level of understanding that I don't have. These SO answers get near it do not say explicitly:
The 'tag' is just the key with which to look-up the learned document vector, after training is done.
The original 'Paragraph Vectors' research papers, on which Gensim's Doc2Vec
is based, tended to just assume each document had one unique ID – perhaps, a string token just like any other word. (So, too, did a small patch to the original Google word2vec.c
that was once shared, long ago, as a limited example of one mode of 'paragraph vectors`.)
In those original formulations, documents had just one unique ID – lookup key for their vector.
However, it was a fairly obvious/straightforward extension to allow these associated vectors to potentially map to other known shared labels, across many documents. (That is, not a unique vector per document, but a unique vector per label, which might appear on multiple texts.) And further, that multiple such range-of-text vectors might be relevant to a single text, that's known to deserve more-than-one label.
So the word 'tag' was used in the Gensim implementation, to convery that this is an association more general than either a unique-ID, or a known-label, though it might in some cases be either.
If you're just starting out, or trying to match early papers, just consider the 'tag' a single unique ID per document. Give every independent document its own unique name – whether it's something natural from your data source (like a unique article title or primary key), or a mere serial number, from '0'
to the count of docs in your data.
Only if you're trying expert/experimental other approaches, after understanding the basic approach, would you want to either repeat a 'tag' across multiple documents, or use mroe than one 'tag' per document. Neither to those approaches are necessary, or typical, in the initial application of Doc2Vec
.
(And if you start to re-use known tags in training, Doc2Vec
is no longer a strictly 'unsupervised' machine-learning technique, but starts to behave more like a 'supervised' or 'semi-supervised' technique, where you're nudging the algorithm towards desired answers. That's sometimes useful, and appropriate, but starts to complicate estimates of how well your steps are working: you then have to use things like held-back test/validation data to get trustworthy estimates of your system's success.)