The objective is to add token embeddings (customized- obtained using different model) and the positional Embeddings.
Is there a Way I can find out positonal embedding along with the token embeddings for an article(length 500-1000 words) using BART model.
tokenized_sequence = tokenizer(sentence, padding='max_length', truncation=True, max_length=512, return_tensors="pt")
the output is input_ids
and attention_mask
but not parameter to return position_ids
like in BERT model.
bert.embeddings.position_embeddings('YOUR_POSITIONS_IDS')
Or the only way to obtain Positional Embedding is using sinusoidal positional encoding?
The tokenizer is not responsible for the embeddings. It only generates the ids to be fed into the embedding layer. Barts embeddings are learned, i.e. the embedding come from their own embedding layer.
You can retrieve both types of embeddings like this. Here bart
is a BartModel
. The encoding is (roughly) done like this:
embed_pos = bart.encoder.embed_positions(input_ids)
inputs_embeds = bart.encoder.embed_tokens(input_ids)
hidden_states = inputs_embeds + embed_pos
Full working code:
from transformers import BartForConditionalGeneration, BartTokenizer
bart = BartForConditionalGeneration.from_pretrained("facebook/bart-base", forced_bos_token_id=0)
tok = BartTokenizer.from_pretrained("facebook/bart-base")
example_english_phrase = "UN Chief Says There Is No <mask> in Syria"
input_ids = tok(example_english_phrase, return_tensors="pt").input_ids
embed_pos = bart.model.encoder.embed_positions(input_ids) * bart.model.encoder.embed_scale # by default the scale is 1.0
inputs_embeds = bart.model.encoder.embed_tokens(input_ids)
hidden_states = inputs_embeds + embed_pos
Note that embed_pos
is invariant to the actual token ids. Only their position matters. "New" embeddings are added if the input grows larger without changing the embeddings of the earlier positions:
These cases yield the same embeddings:
embed_positions([0, 1]) == embed_positions([123, 241]) == embed_positions([444, 3453, 9344, 3453])[:2]