machine-learninghuggingface-transformersdataformatmlmodelgemma

Do those `[0]` make sense in making the variable


The guide for fine-tuning Gemma with HuggingFace toolset is at: https://huggingface.co/blog/gemma-peft

Link to the line: https://huggingface.co/blog/gemma-peft#:~:text=Quote%3A%20%7Bexample-,%5B%27quote%27%5D%5B0%5D,-%7D%5CnAuthor%3A

The data entry formatter func is:

def formatting_func(example):
    text = f"Quote: {example['quote'][0]}\nAuthor: {example['author'][0]}<eos>"
    return [text]

Do those [0] make sense? they look wrong coz when printing out text variable I can see they are just characters instead of strings.


Solution

  • Yes, variable example is a slice of dataset (a batch), not a one item from the dataset. In your particular case you have batch_size set to 1:

    per_device_train_batch_size=1

    It means that entire dataset is splitted into batches of size 1, i.e. arrays of size 1. So one batch is of the following representation:

    [{"quote": ["Quote 1"], "author": ["Author 1"]}]
    

    So in order to get the values you use 0 as index.

    Normally if you use batch size larger than or equal to 2 2, the representation would like:

    [{"quote": ["Quote 1", "Quote 2", ...], "author": ["Author 1", "Author 2", ...]}]
    

    So in line with documentation you would like to use other formatting_func, where you iterate over both arrays like in this case:

    def formatting_prompts_func(example):
        output_texts = []
        for i in range(len(example['instruction'])):
            text = f"### Question: {example['instruction'][i]}\n ### Answer: {example['output'][i]}"
            output_texts.append(text)
        return output_texts
    

    Source: https://huggingface.co/docs/trl/en/sft_trainer