gensimdoc2vec

What is the purpose of Tags in Doc2Vec TaggedDocument?


Is it to aid in classification tasks? The [docs][1] and tutorials don't explain this; they seem to assume a level of understanding that I don't have. These SO answers get near it do not say explicitly:


Solution

  • The 'tag' is just the key with which to look-up the learned document vector, after training is done.

    The original 'Paragraph Vectors' research papers, on which Gensim's Doc2Vec is based, tended to just assume each document had one unique ID – perhaps, a string token just like any other word. (So, too, did a small patch to the original Google word2vec.c that was once shared, long ago, as a limited example of one mode of 'paragraph vectors`.)

    In those original formulations, documents had just one unique ID – lookup key for their vector.

    However, it was a fairly obvious/straightforward extension to allow these associated vectors to potentially map to other known shared labels, across many documents. (That is, not a unique vector per document, but a unique vector per label, which might appear on multiple texts.) And further, that multiple such range-of-text vectors might be relevant to a single text, that's known to deserve more-than-one label.

    So the word 'tag' was used in the Gensim implementation, to convery that this is an association more general than either a unique-ID, or a known-label, though it might in some cases be either.

    If you're just starting out, or trying to match early papers, just consider the 'tag' a single unique ID per document. Give every independent document its own unique name – whether it's something natural from your data source (like a unique article title or primary key), or a mere serial number, from '0' to the count of docs in your data.

    Only if you're trying expert/experimental other approaches, after understanding the basic approach, would you want to either repeat a 'tag' across multiple documents, or use mroe than one 'tag' per document. Neither to those approaches are necessary, or typical, in the initial application of Doc2Vec.

    (And if you start to re-use known tags in training, Doc2Vec is no longer a strictly 'unsupervised' machine-learning technique, but starts to behave more like a 'supervised' or 'semi-supervised' technique, where you're nudging the algorithm towards desired answers. That's sometimes useful, and appropriate, but starts to complicate estimates of how well your steps are working: you then have to use things like held-back test/validation data to get trustworthy estimates of your system's success.)