tensorflow keras bert-language-model huggingface-transformers language-model

Bert with Padding and Masked Token Predicton

I am Playing around with Bert Pretrained Models (bert-large-uncased-whole-word-masking) I used Huggingface to try it I first Used this Piece of Code

m = TFBertLMHeadModel.from_pretrained("bert-large-cased-whole-word-masking")
logits = m(tokenizer("hello world [MASK] like it",return_tensors="tf")["input_ids"]).logits

I then used Argmax to get max probabilities after applying softmax, Things works fine Until now.

When I used padding with max_length = 100 The model started making false prediction and not working well and all predicted tokens were the same i.e 119-Token ID

Code I used for Argmax

tf.argmax(tf.keras.activations.softmax(m(tokenizer("hello world [MASK] like it",return_tensors="tf",max_length=,padding="max_length")["input_ids"]).logits)[0],axis=-1)

Output Before using padding

<tf.Tensor: shape=(7,), dtype=int64, numpy=array([ 9800, 19082,  1362,   146,  1176,  1122,   119])>

Output After using padding with max_length of 100

<tf.Tensor: shape=(100,), dtype=int64, numpy=
array([119, 119, 119, 119, 119, 119, 119, 119, 119, 119, 119, 119, 119,
       119, 119, 119, 119, 119, 119, 119, 119, 119, 119, 119, 119, 119,
       119, 119, 119, 119, 119, 119, 119, 119, 119, 119, 119, 119, 119,
       119, 119, 119, 119, 119, 119, 119, 119, 119, 119, 119, 119, 119,
       119, 119, 119, 119, 119, 119, 119, 119, 119, 119, 119, 119, 119,
       119, 119, 119, 119, 119, 119, 119, 119, 119, 119, 119, 119, 119,
       119, 119, 119, 119, 119, 119, 119, 119, 119, 119, 119, 119, 119,
       119, 119, 119, 119, 119, 119, 119, 119, 119])>

I wonder if this problem prevail even training a new model as It is mandatory to set Input shape for training new model I Padded and tokenized the data but, now I want to know if this problem continues with it too.

Solution

As already mentioned in the comments, you forgot to pass the attention_mask to BERT and it, therefore, treated the added padding tokens like ordinary tokens.

You also asked in the comments how you can rid of the padding token prediction. There are several ways to do it depending on your actual task. One of them is removing them with boolean_mask and the attention_mask as shown below:

import tensorflow as tf
from transformers import TFBertLMHeadModel, BertTokenizerFast

ckpt = "bert-large-cased-whole-word-masking"

t = BertTokenizerFast.from_pretrained(ckpt)
m = TFBertLMHeadModel.from_pretrained(ckpt)

e = t("hello world [MASK] like it",return_tensors="tf")
e_padded = t("hello world [MASK] like it",return_tensors="tf", padding="max_length", max_length = 100)

def prediction(encoding):
  logits = m(**encoding).logits
  token_mapping = tf.argmax(tf.keras.activations.softmax(logits),axis=-1)
  return tf.boolean_mask(token_mapping, encoding["attention_mask"])

token_predictions = prediction(e) 
token_predictions_padded = prediction(e_padded) 

print(token_predictions)
print(token_predictions_padded)

Output:

tf.Tensor([ 9800 19082  1362   146  1176  1122   119], shape=(7,), dtype=int64)
tf.Tensor([ 9800 19082  1362   146  1176  1122   119], shape=(7,), dtype=int64)