I have a list of strings and I want to use them to fine tune Llama 2. Every entry of the list contains a couple of sentences.
I need to bring this into the right format to use the Trainer of the transformers library. But I don't seem to find anything online. This should be a really basic problem?
I don't need a validation dataset. Just a way to feed the dataset in to the trainer via
trainer = transformers.Trainer(model=model,train_dataset=dataset,... )
this is what I have tried:
from datasets import Dataset
dataset = Dataset.from_list(list)
This is what worked for me in the end:
import pandas as pd
df = pd.DataFrame(list)
from datasets import Dataset
dataset = Dataset.from_pandas(df.rename(columns={0: "train"}), split="train")
and then to tokenize the data:
tokenized_dataset = dataset.map(lambda samples: tokenizer(samples["train"]), batched=True)