[SOLVED] How can one obtain the "correct" embedding layer in BERT?

How can one obtain the "correct" embedding layer in BERT?

I want to utilize BERT to assess the similarity between two pieces of text:

from transformers import AutoTokenizer, AutoModel
import torch
import torch.nn.functional as F
import numpy as np

tokenizer = AutoTokenizer.from_pretrained("bert-base-chinese")
model = AutoModel.from_pretrained("bert-classifier")

def calc_similarity(s1, s2):
    inputs = tokenizer(s1, s2, return_tensors='pt', padding=True, truncation=True)

    with torch.no_grad():
        outputs = model(**inputs)
        embeddings = outputs.last_hidden_state[:, 0, :].cpu().numpy()

    cosine_similarity = F.cosine_similarity(embeddings[0], embeddings[1])
    return cosine_similarity

The similarity presented here is derived from a BERT sentiment classifier, which is a model fine-tuned based on the BERT architecture.

My inquiry primarily revolves around this line of code：

embeddings = outputs.last_hidden_state[:, 0, :].cpu().numpy()

I have observed at least three different implementations regarding this line; in addition to the aforementioned version that retrieves the first row, there are two other variations:

embeddings = outputs.last_hidden_state.mean(axis=1).cpu().numpy()

and

embeddings = model.bert.pooler(outputs.last_hidden_state.cpu().numpy())

In fact, vector outputs.last_hidden_state is a 9*768 tensor, and the three aforementioned methods can transform it into a 1*768 vector, thereby providing a basis for subsequent similarity calculations. From my perspective, the first approach is not appropriate within the semantic space defined by the classification task, as our objective is not to predict the next word. What perplexes me is the choice between the second and third methods, specifically whether to employ a simple average or to utilize the pooling layer of the model itself.

Any assistance would be greatly appreciated!

Solution

1st approach is not a good choice because leveraging the [CLS] token embedding directly might not be the best approach, in case if the BERT was fine tuned for a task other than similarity matching.

Task-Specific Embeddings: The [CLS] token embedding is affected by the task the bert model was trained on.
Averaging : Taking the mean of all token embeddings, we can get a more general representation of the input. This method balances out the representation by considering the contextual embeddings of all tokens.

Consider taking average or pooling (passing through another dense layer) will work.