import os
os.environ["CUDA_VISIBLE_DEVICES"]="0"
import torch
from transformers import AutoTokenizer, AutoConfig, AutoModelForCausalLM, GPT2LMHeadModel
model = AutoModelForCausalLM.from_pretrained(
"gpt2",
device_map='auto',
)
model2 = GPT2LMHeadModel(model.config).to("cuda")
model2.load_state_dict(model.state_dict())
tokenizer = AutoTokenizer.from_pretrained("gpt2")
t = tokenizer("hello_world", return_tensors="pt")["input_ids"].to("cuda")
a = model(t).logits
b = model2(t).logits
print(a - b)
print(a)
print(b)
model2 behaves very differently from the model (loss being much higher), but the model structures and parameters are exactly the same. From the output, it looks like something is randomized for model2. Could anyone tell what was going on? I have the "accelerate" package installed.
The config and the parameters are the same. I also checked the forward functions, and there is no difference at all. However, setting model2.transformer.forward = model.transformer.forward, and then the two models would behave the same.
You need to set the models to eval
mode to disable dropout if you want them to produce the same results
import os
os.environ["CUDA_VISIBLE_DEVICES"]="0"
import torch
from transformers import AutoTokenizer, AutoConfig, AutoModelForCausalLM, GPT2LMHeadModel
model = AutoModelForCausalLM.from_pretrained(
"gpt2",
device_map='auto',
)
model2 = GPT2LMHeadModel(model.config).to("cuda")
model2.load_state_dict(model.state_dict())
# set to eval
model.eval()
model2.eval()
tokenizer = AutoTokenizer.from_pretrained("gpt2")
t = tokenizer("hello_world", return_tensors="pt")["input_ids"].to("cuda")
a = model(t).logits
b = model2(t).logits
assert (a == b).all()