I am new to NLP and I was trying gpt2 to train on my own data.
from transformers import GPT2Config, GPT2LMHeadModel, GPT2Tokenizer, TextDataset, DataCollatorForLanguageModeling, Trainer, TrainingArguments
config = GPT2Config(vocab_size=10000, n_positions=256, n_ctx=256, n_embd=512, n_layer=12, n_head=8)
model = GPT2LMHeadModel(config=config)
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
train_data = TextDataset(tokenizer=tokenizer, file_path='train.txt', block_size=256)
training_args = TrainingArguments(
output_dir='./models',
overwrite_output_dir=True,
num_train_epochs=1,
per_device_train_batch_size=4,
save_steps=1000,
save_total_limit=2,
prediction_loss_only=True,
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_data,
data_collator=DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False),
prediction_loss_only=True,
)
trainer.train()
this is my code, and I have checked the train data is being loaded correclty and getting converted to embeddings.
My train data looks like:
""" Hello How are you doing today? whats up MD im doing good how are you doing? Im alright, I just took a nap. But it was one of those naps that doesnt help anything. It just makes everything worse and you question all your life choices oh wow haha so you still feel tired huh? Yeah did you go to bed late? """
When I am calling trainer.train() function, I am getting this error IndexError: index out of range in self, in this line torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse). The problem is I can't check the values of weight, input etc as it is an internal function.
Please help.
I tried changing the parameters like batch size and per_device_train_batch_size but I am still stuck.
The error you are experiencing is most likely due to the size of the vocabulary you have set in your GPT2Config.
You have set the vocab_size to 10000, but the actual size of the vocabulary in the GPT-2 model is 50257. Therefore, the model is expecting input token IDs to be between 0 and 50256, but some of the token IDs in your training data are outside this range.
To fix this, you should set the vocab_size in your GPT2Config to 50257. Also, make sure that the tokenizer you are using is the same as the one used to tokenize your training data. If the tokenizer is different, the token IDs in your training data may not match the expected token IDs of the model.
from transformers import GPT2Config, GPT2LMHeadModel, GPT2Tokenizer, TextDataset, DataCollatorForLanguageModeling, Trainer, TrainingArguments
config = GPT2Config(vocab_size=50257, n_positions=256, n_ctx=256, n_embd=512, n_layer=12, n_head=8)
model = GPT2LMHeadModel(config=config)
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
train_text = """Hello How are you doing today? whats up MD im doing good how are you doing? Im alright, I just took a nap. But it was one of those naps that doesnt help anything. It just makes everything worse and you question all your life choices oh wow haha so you still feel tired huh? Yeah did you go to bed late?"""
train_data = TextDataset(tokenizer=tokenizer, file_path=None, split_text=train_text, block_size=256)
training_args = TrainingArguments(
output_dir='./models',
overwrite_output_dir=True,
num_train_epochs=1,
per_device_train_batch_size=4,
save_steps=1000,
save_total_limit=2,
prediction_loss_only=True,
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_data,
data_collator=DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False),
prediction_loss_only=True,
)
trainer.train()