machine-learningnlpword2vecword-embeddingwmd

Word Mover's distance calculation between word pairs of two documents


According to WMD paper, travel cost or Euclidean distance between word pairs is calculated the way shown in the figure below.

enter image description here

Is this distance calculated in pair wise in a specific order? Such that the first, second and so on from each document as shown in the figure Or Obama's distance is calculated from all four words in D0 and then the minimum of these four is shown in the figure only.

Can someone explain how this works?

Also, why is then all three words in D3 compared with President in D0?


Solution

  • The calculation of WMD requires finding the cheapest shifting of word-weight-configuration in a first text, into the word-weight-configuration of the second text.

    The word-order is irrelevant. Any word's mass in one text could be shifted to the position of any word in the other text. The optimization process which finds the best shifts thus will consider many possible pairings. After it finds the best, the final single WMD number is the total travel distance in that best solution.

    Because of differences in word count, words may not be shifted one-to-one, but as proportion of the full text's mass. So consider the bottom example in the graphic you included: the top text D0 has 4 significant words, and the bottom text D3 has just 3 significant words. So each of the top text's 4 words can be thought of as having 0.25 mass, and each of the bottom text's words can be thought of as having 0.33 mass.

    'Obama' might thus map very closely to 'President' - but even moving 0.25 of 'Obama' mass to 'President' leaves 0.08 mass left over that must travel to another D0 word. Similarly with 'Illinois' and 'Chicago' – even if a 0.25 of 'Illinois' mass is moved to 'Chicago', 0.08 is left-over that must travel to another D0 word. The exact mix of paths and proportions chosen will be the best possible, but will typically involve some words being fractionally shifted across multiple other words.