I'm fine-tunning a transformer seq2seeq model (GODEL base), but can't seem to save history in the tokenizers quite well. Here's the code:
context = list(df['Context'])
knowledge = list(df['Knowledge'])
response = list(df['Response'])
# Initialize the tokenizer
tokenizer = AutoTokenizer.from_pretrained("microsoft/GODEL-v1_1-base-seq2seq", padding_side='left', truncation_side='left')
for i in range(len(context)):
# Prepare the history
history = ""
for j in range(i + 1):
history += f"{context[j]} {knowledge[j]} {response[j]}"
# Tokenize the input sequences
inputs = tokenizer(history, context[i], knowledge[i], padding= "longest",max_length=512, truncation=True, return_tensors="pt" )
# Encode the response sequences
outputs = tokenizer(history, response[i], padding="longest",max_length=512, truncation=True, return_tensors="pt" )
The output tokenizer should store the context of the present index and the context+knoweldge+response of all the previous indexes making history.
Here, I was trying to iterate over a pandas series and considering it a list.
To resolve this, Use the
.tolist()
function on the pandas series before iterating over it.