pytorchnlphuggingface-transformerssummarizationbart

How to Find Positional embeddings from BARTTokenizer?


The objective is to add token embeddings (customized- obtained using different model) and the positional Embeddings.

Is there a Way I can find out positonal embedding along with the token embeddings for an article(length 500-1000 words) using BART model.

tokenized_sequence = tokenizer(sentence, padding='max_length', truncation=True, max_length=512, return_tensors="pt")

the output is input_ids and attention_mask but not parameter to return position_ids like in BERT model.

bert.embeddings.position_embeddings('YOUR_POSITIONS_IDS')

Or the only way to obtain Positional Embedding is using sinusoidal positional encoding?


Solution

  • The tokenizer is not responsible for the embeddings. It only generates the ids to be fed into the embedding layer. Barts embeddings are learned, i.e. the embedding come from their own embedding layer.

    You can retrieve both types of embeddings like this. Here bart is a BartModel. The encoding is (roughly) done like this:

    embed_pos = bart.encoder.embed_positions(input_ids)
    inputs_embeds = bart.encoder.embed_tokens(input_ids)
    hidden_states = inputs_embeds + embed_pos
    

    Full working code:

    from transformers import BartForConditionalGeneration, BartTokenizer
    
    bart = BartForConditionalGeneration.from_pretrained("facebook/bart-base", forced_bos_token_id=0)
    tok = BartTokenizer.from_pretrained("facebook/bart-base")
    example_english_phrase = "UN Chief Says There Is No <mask> in Syria"
    input_ids = tok(example_english_phrase, return_tensors="pt").input_ids
    
    embed_pos = bart.model.encoder.embed_positions(input_ids) * bart.model.encoder.embed_scale # by default the scale is 1.0
    inputs_embeds = bart.model.encoder.embed_tokens(input_ids)
    hidden_states = inputs_embeds + embed_pos
    

    Note that embed_pos is invariant to the actual token ids. Only their position matters. "New" embeddings are added if the input grows larger without changing the embeddings of the earlier positions:

    These cases yield the same embeddings: embed_positions([0, 1]) == embed_positions([123, 241]) == embed_positions([444, 3453, 9344, 3453])[:2]