I would like to visualize the attention layer of a Phi-3-medium-4k-instruct
(or mini) model downloaded from hugging-face. In particular, I am using the following model, tokenizer
:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
import pdb
tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-3-medium-4k-instruct")
model = AutoModelForCausalLM.from_pretrained(
"microsoft/Phi-3-meduium-4k-instruct",
device_map = "auto",
torch_dtype = "auto",
trust_remote_code = True
)
# Create a pipeline
generator = pipeline(
"text-generation",
model = model,
tokenizer = tokenizer,
return_full_text= False,
max_new_tokens = 50,
do_sample = False
)
prompt = "..."
input_ids = tokenizer(prompt, return_tensors = "pt").input_ids
# tokenize the input prompt
input_ids = input_ids.to("cuda:0")
# get the output of the model
model_output = model.model(input_ids)
# extract the attention layer
attention = model_output[-1]
Firstly, I am wondering if that is the correct way to extract attention from my model. What should expect from this model and how can I visualize it properly? Isn't that I should expect a matrix n_tokens x n_tokens
?
The attention
variable I have extracted has a size of 1x40x40x15x15
(or 1x12x12x15x15
in the case of mini
model), where the first dimension corresponds to different layers the second for the different heads
, and the final two for the attention matrix
. That is actually my assumption and I am not sure whether it is correct. When I am visualizing the attention I am getting some very weird matrices like:
What we see in this Figure, I assume is all the heads for one layer. However, most of the heads distribute the attention equally to all the tokens. Does that make sense?
Edit: For the visualization I am doing sth like:
# Save attention visualization code
def save_attention_image(attention, tokens, filename='attention.png'):
"""
Save the attention weights for a specific layer and head as an image.
:param attention: The attention weights from the model.
:param tokens: The tokens corresponding to the input.
:param layer_num: The layer number to visualize.
:param head_num: The head number to visualize.
:param filename: The filename to save the image.
"""
attn = attention[0].detach().cpu().float().numpy()
num_heads = attn.shape[0]
fig, axes = plt.subplots(3, 4, figsize=(20, 15)) # Adjust the grid size as needed
for i, ax in enumerate(axes.flat):
if i < num_heads:
cax = ax.matshow(attn[i], cmap='viridis')
ax.set_title(f'Head {i + 1}')
ax.set_xticks(range(len(tokens)))
ax.set_yticks(range(len(tokens)))
ax.set_xticklabels(tokens, rotation=90)
ax.set_yticklabels(tokens)
else:
ax.axis('off')
fig.colorbar(cax, ax=axes.ravel().tolist())
plt.suptitle(f'Layer {1}')
plt.savefig(filename)
plt.close()
here is what you need to know: RUNNING COLAB CODE - https://colab.research.google.com/drive/13gP71u_u_Ewx8u7aTwgzSlH0N_k9XBXx?usp=sharing
you want see attention weights from your phi3 model. first thing: you must tell model to output attentions. usually you do
outputs = model(input_ids, output_attentions=True)
then outputs.attentions will be tuple with one element per layer. each element is tensor shape (batch, num_heads, seq_len, seq_len) – that is what you expect, a matrix n_tokens x n_tokens per head.
what you did using
model_output = model.model(input_ids)
attention = model_output[-1]
may or may not be correct – depends on how model.forward is coded. better use output_attentions flag so you get proper attention weights.
about the shape you see, e.g. 1x40x40x15x15 (or 1x12x12x15x15) – this likely means:
if many heads show nearly uniform attention it can be normal – sometimes heads do that, not focusing on any token particularly.
for proper visualization, select one layer and one head like:
attn = outputs.attentions[layer][0, head] # shape (seq_len, seq_len)
and then use your plotting code (imshow or matshow) to visualize.
so summary: use model(..., output_attentions=True) to get correct attention, then each attention tensor will be (batch, heads, seq_len, seq_len) – that is the matrix you expect. if you see extra dimensions then check if you are calling the right forward method. and yes, many heads may show uniform distribution – that can be normal in transformer models.
hope this helps, and you can put my code in your colab as is.
note that
When using Hugging Face Transformers, the recommended approach is to run:
outputs = model(
input_ids=inputs,
output_attentions=True,
# possibly also output_hidden_states=True if you want hidden states
)
Then outputs.attentions will be a tuple with one entry per layer, each entry shaped (batch_size, num_heads, seq_len, seq_len).
If you call model.model(input_ids) directly (as in your code snippet), you might be accessing a lower-level forward function that returns a different structure. Instead, call the top-level model with output_attentions=True. That yields attention shapes more in line with standard Hugging Face conventions.
Ok so basically you want see attention. You pass output_attentions=True when calling model, then get outputs.attentions. That is standard shape (batch, heads, seq_len, seq_len). Then pick layer and head to plot. Some heads look uniform, that is normal. If you do model.model(input_ids) directly, might not give the standard shape. Safer is:
# !pip install transformers torch
import torch
import matplotlib.pyplot as plt
from transformers import AutoModelForCausalLM, AutoTokenizer
# Load tokenizer and model (make sure you have a valid license for the model)
tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-3-medium-4k-instruct")
model = AutoModelForCausalLM.from_pretrained(
"microsoft/Phi-3-medium-4k-instruct", # note: check spelling if you get error
device_map="auto",
torch_dtype=torch.float16, # or torch.float32 if preferred
trust_remote_code=True
)
# Prepare a prompt
prompt = "The quick brown fox jumps over the lazy dog."
inputs = tokenizer(prompt, return_tensors="pt")
inputs = inputs.to("cuda:0") # send inputs to cuda
# Run the model with attention outputs enabled
# Make sure to pass output_attentions=True
outputs = model(input_ids=inputs.input_ids, output_attentions=True)
# outputs.attentions is a tuple with one element per layer
# Each element is a tensor of shape (batch_size, num_heads, seq_len, seq_len)
attentions = outputs.attentions
# For example, choose layer 0 and head 0 to visualize
layer = 0
head = 0
attn = attentions[layer][0, head].detach().cpu().numpy() # shape (seq_len, seq_len)
# Get tokens for labeling the axes
tokens = tokenizer.convert_ids_to_tokens(inputs.input_ids[0])
# Visualize the attention matrix using matplotlib
plt.figure(figsize=(8,8))
plt.imshow(attn, cmap="viridis")
plt.colorbar()
plt.xticks(range(len(tokens)), tokens, rotation=90)
plt.yticks(range(len(tokens)), tokens)
plt.title(f"Attention Matrix (Layer {layer}, Head {head})")
plt.show()
Now you see nice n_tokens by n_tokens matrix. If model has 12 layers, you see 12 in outputs.attentions. If “medium” is 40 layers, you see 40. Each head is shape 15×15 if your input is 15 tokens. Some heads do uniform attention, that is normal. That is basically all.
NOTE -
When you do something like:
model_output = model.model(input_ids)
attention = model_output[-1]
You’re relying on how the internal forward method organizes its return. Some models do return (hidden_states, present, attentions, ...) but some do not. It’s safer to rely on the official Hugging Face usage:
outputs = model(..., output_attentions=True)
attention = outputs.attentions
That’s guaranteed to be the standard shape.