I am looking at the Keras Glove word embedding example and it is not clear why the first row of the embedding matrix is populated with zeros.
First, the embedding index is created where words are associated with arrays.
embeddings_index = {}
with open(os.path.join(GLOVE_DIR, 'glove.6B.100d.txt')) as f:
for line in f:
word, coefs = line.split(maxsplit=1)
coefs = np.fromstring(coefs, 'f', sep=' ')
embeddings_index[word] = coefs
Then the embedding matrix is created by looking at words from the index created by tokenizer.
# prepare embedding matrix
num_words = min(MAX_NUM_WORDS, len(word_index) + 1)
embedding_matrix = np.zeros((num_words, EMBEDDING_DIM))
for word, i in word_index.items():
if i >= MAX_NUM_WORDS:
continue
embedding_vector = embeddings_index.get(word)
if embedding_vector is not None:
# words not found in embedding index will be all-zeros.
embedding_matrix[i] = embedding_vector
Since the loop will start with i=1
, then the first row will contain only zeros and random numbers if the matrix is initialized differently. Is there a reason for skipping the first row?
The whole started from the fact that the Tokenizer
's programmers reserved the index 0
for some reason, maybe for some compatibility (some other languages use indexing from 1
) or coding technic reasons.
However they use numpy, where they want to indexing with the simple:
embedding_matrix[i] = embedding_vector
indexing, so the [0]
indexed row stays full of zeros and there is no case where, as wrote "random numbers if the matrix is initialized differently", because this array has been initialized with zeros.
So from this line we don't need the first row at all, but you can't delete it as the numpy array would lost the aligning its indexing with the tokenizer's indexing.