pythondatasethuggingface-transformershuggingface-datasetsfine-tuning

How to create a dataset with Huggingface from a list of strings to fine-tune Llama 2 with the transformers library?


I have a list of strings and I want to use them to fine tune Llama 2. Every entry of the list contains a couple of sentences.

I need to bring this into the right format to use the Trainer of the transformers library. But I don't seem to find anything online. This should be a really basic problem?

I don't need a validation dataset. Just a way to feed the dataset in to the trainer via

trainer = transformers.Trainer(model=model,train_dataset=dataset,... )

this is what I have tried:

from datasets import Dataset

dataset = Dataset.from_list(list)

Solution

  • This is what worked for me in the end:

    import pandas as pd
    df = pd.DataFrame(list)
    
    from datasets import Dataset
    dataset = Dataset.from_pandas(df.rename(columns={0: "train"}), split="train")
    

    and then to tokenize the data:

    tokenized_dataset = dataset.map(lambda samples: tokenizer(samples["train"]), batched=True)