[SOLVED] How to create a dataset with Huggingface from a list of strings to fine-tune Llama 2 with the transformers library?

How to create a dataset with Huggingface from a list of strings to fine-tune Llama 2 with the transformers library?

I have a list of strings and I want to use them to fine tune Llama 2. Every entry of the list contains a couple of sentences.

I need to bring this into the right format to use the Trainer of the transformers library. But I don't seem to find anything online. This should be a really basic problem?

I don't need a validation dataset. Just a way to feed the dataset in to the trainer via

trainer = transformers.Trainer(model=model,train_dataset=dataset,... )

this is what I have tried:

from datasets import Dataset

dataset = Dataset.from_list(list)

Solution

This is what worked for me in the end:

import pandas as pd
df = pd.DataFrame(list)

from datasets import Dataset
dataset = Dataset.from_pandas(df.rename(columns={0: "train"}), split="train")

and then to tokenize the data:

tokenized_dataset = dataset.map(lambda samples: tokenizer(samples["train"]), batched=True)