machine-learningpytorchnlphuggingface-transformersgpt-2

How to change the fully connected network in a GPT model on Huggingface?


I'm following this tutorial on training a causal language model from scratch.

In the tutorial they load the standard GPT2 as follows:

from transformers import AutoTokenizer, GPT2LMHeadModel, AutoConfig

config = AutoConfig.from_pretrained(
    "gpt2",
    vocab_size=len(tokenizer),
    n_ctx=context_length,
    bos_token_id=tokenizer.bos_token_id,
    eos_token_id=tokenizer.eos_token_id,
)
model = GPT2LMHeadModel(config)

How can I load the same model, but use my custom fully connected network instead of the standard one? Mainly want to experiment with variations such as more/less layers, different activation functions, etc.

I found the source code here, but it's very convoluted and I can't figure out how to replace the fully connected parts with a custom ones or what structure the custom one should have in the first place (e.g., input/output size).

Update For example, using a FC network as such:

class FC_model(nn.Module):
    def __init__(self):
        super(FC_model, self).__init__()

        self.fc1 = nn.Linear(768,256)
        self.fc2 = nn.Linear(256,256)
        self.fc3 = nn.Linear(256,50000)

    def forward(self, x):
        x = torch.sin(self.fc1(x)) + torch.rand(1)
        x = torch.sin(self.fc2(x))
        x = self.fc3(x)
        return x

Solution

  • I'm assuming by the fully connected network you're referring to the Fully Connected (FC) / Linear layer.

    from transformers import AutoTokenizer, GPT2LMHeadModel, AutoConfig, GPT2Config
    configuration = GPT2Config()
    model = GPT2LMHeadModel(configuration)
    print(model) 
    

    The above would show you the modules inside the model:

    GPT2LMHeadModel(
      (transformer): GPT2Model(
        (wte): Embedding(50257, 768)
        (wpe): Embedding(1024, 768)
        (drop): Dropout(p=0.1, inplace=False)
        (h): ModuleList(
          (0-11): 12 x GPT2Block(
            (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
            (attn): GPT2Attention(
              (c_attn): Conv1D()
              (c_proj): Conv1D()
              (attn_dropout): Dropout(p=0.1, inplace=False)
              (resid_dropout): Dropout(p=0.1, inplace=False)
            )
            (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
            (mlp): GPT2MLP(
              (c_fc): Conv1D()
              (c_proj): Conv1D()
              (act): NewGELUActivation()
              (dropout): Dropout(p=0.1, inplace=False)
            )
          )
        )
        (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      )
      (lm_head): Linear(in_features=768, out_features=50257, bias=False)
    )
    

    You can now access and update the FC layer by:

    model.lm_head = nn.Sequential(
        nn.Linear(in_features = 768, out_features = 256),
        nn.ReLU(inplace = True),
        nn.Dropout1d(0.25),
        nn.Linear(in_features = 256, out_features = 128)
    )
    

    The above is just a sample, you can experiment with different combinations.