I am trying to understand Word2Vec. For word input of 5x1 (one hot encoding) and hidden layer of 3 units.I have came across following information from famous sources. The first (monochrome) image says the first column will become the embedding vector when multiplying 3x 5 to 5 x 1 matrix while in 2nd image, 4th row is taken against 1 x 5 one hot encoding. This is confusing. The embedding lookup pics row or column is not understandable. Please help.
There's a lot of custom intros to the word2vec algorithm online that, in my opinion, are quite unhelpful in what they choose to highlight.
So, if you're struggling with a particular one, I'd suggest moving on to some other. And if you have to go someplace else (like StackOverflow) to explain some external writeup, try to provide a link to the full original source as context for understanding what that particular author has adopted as their mental-model.
Further, if your true interest is in understanding actual word2vec implementation, studying working source code may be a better path than more abstract write-ups, for seeing what actually happens.
Although abstractly, you can think of a word-vector lookup as being the multiplication of a one-hot vector times a (count-of-vocabulary x count-of-dimensions) matrix, I've not seen popular implementations (like Google's original word2vec.c
release, or the Python Gensim library) do exactly that. So learning that form can mislead when later using, or reviewing the source code of, or implementing real code.
Instead, implementations tend to use the word-token as a key to lookup a row-number inside some sort of dict
/hashtable. (No one-hot vector is ever created – except abstractly, in the sense that a simple int
can be thought of as representing the one-hot vector with a single one at that int index.)
Then, they use that row-number to access the word's vector, from a matrix that's better considered the "input weights" that lead to a hidden-layer, rather than any "hidden layer" itself. (The "hidden layer" activations, at least in 1:1 modes like skip gram, is then that vector itself.)
That is: despite the abstract description, in implementations, no multiplication occurs. An index-lookup occurs, then a simple row-access-by-that-index occurs, and then you've got a word's vector. (And yes, it's the same result as-f there'd been a one-hot multiplication - at least in a simple skip-gram mode, where the 'input' to the network is a single context word.)
To try to map to that diagram's top (monochrome) half: you have a 5-word vocabulary, where each word has 3 dimensions. The columns of that w00 .. w24
table, with 0-based indexes {0,1,2,3,4} are the individual word-vectors. (This varies from most implementations I know, where individual word-vectors are the rows of the model's matrix.)
So, per the top (monochrome) half, you get the 3-dimensional word-vector for the 1st of 5 words by pulling the column that's w[0][0] .. w[2][0]
: rows 0-2, column 0
In contrast, per the bottom (multicolor) half, you get the 3-dimensional word-vector for the 4th of 5 words by pulling the row that's cell[3][0] to cell[3][2]
: rows 3, columns 0-2
The bottom diagram better fits the implementations I know. There, learned word-vectors – both the in-progress vectors grabbed for adjustment during training, & what's accessed at the end as final word-vectors – are more often stored as the rows of an "input weights" matrix.
That matrix can be thought of as a mapping from one-hot vectors to hidden-layer activations - but really isn't a "hidden layer" itself. And further: in a mode like "CBOW with averaging", the actual "hidden layer" activations are an average of multiple rows' values.
Of your 2 diagrams, the bottom better represents usual implementations – though again, usually no actual "multiplication by a one-hot" occurs.
Hope this helps!