python tensorflow keras fasttext language-model

Keras model with fasttext word embedding

I am trying to learn a language model to predict the last word of a sentence given all the previous words using keras. I would like to embed my inputs using a learned fasttext embedding model.

I managed to preprocess my text data and embed the using fasttext. My training data is comprised of sentences of 40 tokens each. I created 2 np arrays, X and y as inputs, with y what I want to predict.

X is of shape (44317, 39, 300) with 44317 the number of example sentences, 39 the number of tokens in each sentence, and 300 the dimension of the word embedding.

y is of shape (44317, 300) is for each example the embedding of the last token of the sentence.

My code for the keras model goes as follow (inspired by this)

#importing all the needed tensorflow.keras components
model = Sequential()  
model.add(InputLayer((None, 300)))
model.add(LSTM(100, return_sequences=True))
model.add(LSTM(100))
model.add(Dense(100, activation='relu'))
model.add(Dense(300, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
model.fit(X, y, batch_size=128, epochs=20)
model.save('model.h5')

However, the accuracy I get while training on this model is extremely low (around 1.5%). I think there is some component of the keras model that I misundertood, as if I don't embed my inputs and add an extra embedding layer instead of the InputLayer I get an accuracy of about 60 percents.

My main doubt is the value of "300" on my second Dense layer, as I read that this should correspond the vocabulary size of my word embedding model (which is 48000), however if I put anything else than 300 I get a dimension error. So I understand that I'm doing something wrong, but I can't find how to fix it.

PS : I have also tried y = to_categorical(y, num_classes=vocab_size) with vocab_size the vocabulary size of my word embedding, and by changing 300 by this same value in the second Dense, however then it tries to create an array of shape(13295100, 48120) instead of what I expect : (44317, 48120).

Solution

If you really want to use the word vectors from Fasttext, you will have to incorporate them into your model using a weight matrix and Embedding layer. The goal of the embedding layer is to map each integer sequence representing a sentence to its corresponding 300-dimensional vector representation:

import gensim.downloader as api
import numpy as np
import tensorflow as tf

def load_doc(filename):
    file = open(filename, 'r')
    text = file.read()
    file.close()
    return text

fasttext = api.load("fasttext-wiki-news-subwords-300")
embedding_dim = 300

in_filename = 'data.txt'
doc = load_doc(in_filename)
lines = doc.split('\n')

tokenizer = tf.keras.preprocessing.text.Tokenizer()
tokenizer.fit_on_texts(lines)
text_sequences = tokenizer.texts_to_sequences(lines)
text_sequences = tf.keras.preprocessing.sequence.pad_sequences(text_sequences, padding='post')
vocab_size = len(tokenizer.word_index) + 1

text_sequences = np.array(text_sequences)
X, y = text_sequences[:, :-1], text_sequences[:, -1]
y = tf.keras.utils.to_categorical(y, num_classes=vocab_size)
max_length = X.shape[1]

weight_matrix = np.zeros((vocab_size, embedding_dim))
for word, i in tokenizer.word_index.items():
    try:
        embedding_vector = fasttext[word]
        weight_matrix[i] = embedding_vector
    except KeyError:
        weight_matrix[i] = np.random.uniform(-5, 5, embedding_dim)

sentence_input = tf.keras.layers.Input(shape=(max_length,))
x = tf.keras.layers.Embedding(vocab_size, embedding_dim, weights=[weight_matrix],
                              input_length=max_length)(sentence_input)

x = tf.keras.layers.LSTM(100, return_sequences=True)(x)
x = tf.keras.layers.LSTM(100)(x)
x = tf.keras.layers.Dense(100, activation='relu')(x)
output = tf.keras.layers.Dense(vocab_size, activation='softmax')(x)
model = tf.keras.Model(sentence_input, output)

model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
model.fit(X, y, batch_size=5, epochs=20)

Note that I am using the dataset and preprocessing steps from the tutorial you linked.