I am training a binary text classification model using BERT as follows:
def create_model():
text_input = tf.keras.layers.Input(shape=(), dtype=tf.string, name='text')
preprocessed_text = bert_preprocess(text_input)
outputs = bert_encoder(preprocessed_text)
# Neural network layers
l1 = tf.keras.layers.Dropout(0.1, name="dropout")(outputs['pooled_output'])
l2 = tf.keras.layers.Dense(1, activation='sigmoid', name="output")(l1)
# Use inputs and outputs to construct a final model
model = tf.keras.Model(inputs=[text_input], outputs=[l2])
return model
This code is borrowed from the example on tfhub: https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/4.
I want to extract feature embeddings from the penultimate layer and use them for comparison, clustering, visualization, etc between examples. Should this be done before dropout (l1 in the model above) or after dropout (l2 in the model above)?
I am trying to figure out whether this choice makes a significant difference, or is it fine either way? For example, if I extract feature embeddings after dropout and compute feature similarities between two examples, this might be affected by which nodes are randomly set to 0 (but perhaps this is okay).
In order to answer your question let's recall how a Dropout layer works:
The Dropout layer is usually used as a means to mitigate overfitting. Suppose two layers, A and B, are connected through a Dropout layer. Then during the training phase, neurons in layer A are being randomly dropped. That prevents layer B from becoming too dependent upon specific neurons in layer A, as these neurons are not always available. Therefore, layer B has to take into consideration the overall signal coming from layer A, and (hopefully) cannot cling to some noise which is specific to the training set.
An important point to note is that the Dropout mechanism is activated only during the training phase. While predicting, Dropout does nothing.
If I understand you correctly, you want to know whether to take the features before or after the Dropout (note that in your network l1
denotes the features after Dropout has been applied). If so, I would take the features before Dropout, because technically it does not really matter (Dropout is inactive during prediction) and it is more reasonable to do so (Dropout is not meaningful without a following layer).