nlpgensimword2vecword-embeddingglove

How to compare cosine similarities across three pretrained models?


I have two corpora - one with all women leader speeches and the other with men leader speeches. I would like to test the hypothesis that cosine similarity between two words in the one corpus is significantly different than cosine similarity between the same two words in another corpus. Is such a t-test (or equivalent) logical and possible?

Further, if the cosine similarities are different across the two corpora, how could I examine if cosine similarity between the same two words in a third corpus is more similar to the first or the second corpus?


Solution

  • It's certainly possible. Whether it's meaningful, given a certain amount of data, is harder to answer.

    Note that in separate training sessions, a given word A won't necessarily wind up in the same coordinates, due to inherent randomness used by the algorithm. That's even the case when training on the exact same data.

    It's just the case that in general, the distances/directions to other words B, C, etc should be of similar overall usefulness, when there's sufficient data/training and well-chosen parameters. So A, B, C, etc may be in different places, with slightly-different distances/directions – but the relative relationships are still similar, in terms of neighborhoods-of-words, or the (A-B) direction still be predictive of certain human-perceptible meaning-differences if applied to other words C etc.

    So, you should avoid making direct cosine-similarity comparisons between words from different training-runs or corpuses, but you may find meaning in differences in similarities ( A-B vs A' - B' ) or top-N lists or relative-rankings. (This could also be how to compare against 3rd corpora: to what extent is there variance or correlation in certain pairwise-similarities, or top-N lists, or ordinal ranks of relevant words in each other words' 'most similar' results.)

    You might want to perform a sanity check on your measures, by seeing to what extent they imply meaningful differences in comparisons where they logically "shouldn't". For example, multiple runs against the exact same corpus that's just bee reshuffled, or against random subsets of the exact same corpus. (I'm not aware of anything as formal as a 't-test' in checking the significance of differences between word2vec models, but checking whether some differences are enough to distinguish a truly-different corpus, from just a 1/Nth random subset of the same corpus, to a certain confidence level might be a grounded way to assert meaningful differences.)

    To the extent such "oughtta be very similar" runs instead show end vector results that are tangibly different, it could be suggestive that either:

    You'd also want to watch out for mismatches in training-corpus size. A corpus that's 10x as large means many more words would pass a fixed min_count threshold, and any chosen N epochs of training will involve 10x as many examples of common-words, and support stable results in a larger (vector-size) model - whereas the same model parameters with a smaller corpus would give more volatile results.

    Another technique you could consider would be combining corpuses into one training set, but munging the tokens of key words-of-interest to be different depending on the relevant speaker. For example, you'd replace the word 'family' with 'f-family' or 'm-family', depending on the gender of the speaker. (You might do this for every occurrence, or some fraction of the occurrences. You might also enter each speech into your corpus more than once, sometimes with the actual words and sometimes with some-or-all replaced with the context-labeled alternates.)

    In that case, you'd wind up with one final model, and all words/context-tokens in the 'same' coordinate space for direct comparison. But, the pseudowords 'f-family' and 'm-family' would have been more influenced by their context-specific usages - and thus their vectors might vary from each other, and from the original 'family' (if you've also retained unmunged instances of its use) in interestingly suggestive ways.

    Also note: if using the 'analogy-solving' methods of the original Google word2vec code release, or other libraries that have followed its example (like gensim), note that it specifically won't return as an answer any of the words supplied as input. So when solving the gender-fraught analogy 'man' : 'doctor' :: 'woman' : _?_, via the call model.most_similar(positive=['doctor', 'woman'], negative=['man']), even if the underlying model still has 'doctor' as the closest word to the target-coordinates, it is automatically skipped as one of the input words, yielding the second-closest word instead.

    Some early "bias-in-word-vectors" write-ups ignored this detail and thus tended to imply larger biases, due to this implementation artifact, even where such biases small-to-nonexistent. (You can supply raw vectors, instead of string-tokens, to most_similar() - and then get full results, without any filtering of input-tokens.)