Does sample= 0
in Gensim word2vec mean that no downsampling is being used during my training? The documentation says just that
"useful range is (0, 1e-5)"
However putting the threshold to 0 would cause P(wi) to be equal to 1, meaning that no word would be discarded, am I understanding it right or not?
I'm working on a relatively small dataset of 7597 Facebook posts (18945 words) and my embeddings perform far better using sample= 0
rather than anything else within the recommended range. Is there any particular reason? Text size?
That seems an incredibly tiny dataset for Word2Vec
training. (Is that only 18945 unique words, or 18945 words total, so hardly more than 2 words per post?)
Sampling is most useful on larger datasets - where there are so many examples of common words, more training examples of them aren't adding much – but they are stealing time from, and overwieghting those words' examples compared to, other less-frequent words.
Yes, sample=0
means no down-sampling.