machine-learning pytorch nlp huggingface-transformers gpt-2

How to change the fully connected network in a GPT model on Huggingface?

I'm following this tutorial on training a causal language model from scratch.

In the tutorial they load the standard GPT2 as follows:

from transformers import AutoTokenizer, GPT2LMHeadModel, AutoConfig

config = AutoConfig.from_pretrained(
    "gpt2",
    vocab_size=len(tokenizer),
    n_ctx=context_length,
    bos_token_id=tokenizer.bos_token_id,
    eos_token_id=tokenizer.eos_token_id,
)
model = GPT2LMHeadModel(config)

How can I load the same model, but use my custom fully connected network instead of the standard one? Mainly want to experiment with variations such as more/less layers, different activation functions, etc.

I found the source code here, but it's very convoluted and I can't figure out how to replace the fully connected parts with a custom ones or what structure the custom one should have in the first place (e.g., input/output size).

Update For example, using a FC network as such:

class FC_model(nn.Module):
    def __init__(self):
        super(FC_model, self).__init__()

        self.fc1 = nn.Linear(768,256)
        self.fc2 = nn.Linear(256,256)
        self.fc3 = nn.Linear(256,50000)

    def forward(self, x):
        x = torch.sin(self.fc1(x)) + torch.rand(1)
        x = torch.sin(self.fc2(x))
        x = self.fc3(x)
        return x

Solution

I'm assuming by the fully connected network you're referring to the Fully Connected (FC) / Linear layer.

from transformers import AutoTokenizer, GPT2LMHeadModel, AutoConfig, GPT2Config
configuration = GPT2Config()
model = GPT2LMHeadModel(configuration)
print(model)

The above would show you the modules inside the model:

GPT2LMHeadModel(
  (transformer): GPT2Model(
    (wte): Embedding(50257, 768)
    (wpe): Embedding(1024, 768)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0-11): 12 x GPT2Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D()
          (c_proj): Conv1D()
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
  )
  (lm_head): Linear(in_features=768, out_features=50257, bias=False)
)

You can now access and update the FC layer by:

model.lm_head = nn.Sequential(
    nn.Linear(in_features = 768, out_features = 256),
    nn.ReLU(inplace = True),
    nn.Dropout1d(0.25),
    nn.Linear(in_features = 256, out_features = 128)
)

The above is just a sample, you can experiment with different combinations.