In the hugging face source code, pooled_output = outputs[1]
is used.
outputs = self.bert(
input_ids,
attention_mask=attention_mask,
token_type_ids=token_type_ids,
position_ids=position_ids,
head_mask=head_mask,
inputs_embeds=inputs_embeds,
output_attentions=output_attentions,
output_hidden_states=output_hidden_states,
return_dict=return_dict,
)
pooled_output = outputs[1]
Shouldn't it be pooled_output = outputs[0]
? (This answer mentioning BertPooler seems to be outdated)
Based on this answer, it seems that the CLS token learns a sentence level representation. I am confused as to why/how masked language modelling would lead to the start token learning a sentence level representation. (I am thinking that BertForSequenceClassification
freezes the Bert model and only trains the classification head, but maybe that's not the case)
Would a sentence embedding be equivalent or even better than the [CLS] token embedding?
Would a sentence embedding be equivalent or even better than the [CLS] token embedding?
A sentence embedding is everything that represents the input sequence as a numerical vector. The question is whether this embedding is semantical meaningful (e.g. can we use it with similarity metrics). This is for example not the case for the pretrained Bert weights released by google (refer to this answer for more information).
Is the CLS token a sentence embedding? Yes. Is some kind of pooling a sentence embedding? Yes. Are they semantically meaningful with the Bert weights release by google? No.
Shouldn't it be pooled_output = outputs[0]?
No, because when you check the code, you will see that the first element of the tuple is the last_hidden_state
sequence_output = encoder_outputs[0]
pooled_output = self.pooler(sequence_output) if self.pooler is not None else None
if not return_dict:
return (sequence_output, pooled_output) + encoder_outputs[1:]
I am confused as to why/how masked language modeling would lead to the start token learning a sentence level representation.
Because it is included in every training sequence and the [CLS]
"absorbs" the other tokens. You can also see this in the attention mechanism (compare Revealing the Dark Secrets of BERT paper). As mentioned above, the questions is if they are semantically meaningful without any further finetuning. No (compare this StackOverflow answer).