The guide for fine-tuning Gemma with HuggingFace toolset is at: https://huggingface.co/blog/gemma-peft
Link to the line: https://huggingface.co/blog/gemma-peft#:~:text=Quote%3A%20%7Bexample-,%5B%27quote%27%5D%5B0%5D,-%7D%5CnAuthor%3A
The data entry formatter func is:
def formatting_func(example):
text = f"Quote: {example['quote'][0]}\nAuthor: {example['author'][0]}<eos>"
return [text]
Do those [0]
make sense? they look wrong coz when printing out text
variable I can see they are just characters instead of strings.
Yes, variable example
is a slice of dataset (a batch), not a one item from the dataset. In your particular case you have batch_size set to 1:
per_device_train_batch_size=1
It means that entire dataset is splitted into batches of size 1, i.e. arrays of size 1. So one batch is of the following representation:
[{"quote": ["Quote 1"], "author": ["Author 1"]}]
So in order to get the values you use 0
as index.
Normally if you use batch size larger than or equal to 2 2, the representation would like:
[{"quote": ["Quote 1", "Quote 2", ...], "author": ["Author 1", "Author 2", ...]}]
So in line with documentation you would like to use other formatting_func, where you iterate over both arrays like in this case:
def formatting_prompts_func(example):
output_texts = []
for i in range(len(example['instruction'])):
text = f"### Question: {example['instruction'][i]}\n ### Answer: {example['output'][i]}"
output_texts.append(text)
return output_texts