I would like to train a zero shot classifier on an annotated sample dataset.
I am following some tutorials but as all use their own data and the same pretarined model, I am trying to confirm: Is this the best approach?
Data example:
import pandas as pd
from datasets import Dataset
# Sample feedback data, it will have 8 samples per label
feedback_dict = [
{'text': 'The product is great and works well.', 'label': 'Product Performance'},
{'text': 'I love the design of the product.', 'label': 'Product Design'},
{'text': 'The product is difficult to use.', 'label': 'Usability'},
{'text': 'The customer service was very helpful.', 'label': 'Customer Service'},
{'text': 'The product was delivered on time.', 'label': 'Delivery Time'}
]
# Create a DataFrame with the feedback data
df = pd.DataFrame(feedback_dict)
# convert to Dataset format
df = Dataset.from_pandas(df)
By having the previous data format, this is the approach for model finetunning:
from setfit import SetFitModel, SetFitTrainer
# Select a model
model = SetFitModel.from_pretrained("sentence-transformers/paraphrase-mpnet-base-v2")
# training with Setfit
trainer = SetFitTrainer(
model=model,
train_dataset=df, # to keep the code simple I do not create the df_train
eval_dataset=df, # to keep the code simple I do not create the df_eval
column_mapping={"text": "text", "label": "label"}
)
trainer.train()
The issue here is that the process never ends after more than 500 hours in a laptop, and the dataset it is only about 88 records with 11 labels.
I tried to run the example you posted on Google Colab, it took 37 seconds to run the training.
Here's you code with some tweak to make it work on Colab:
### Install libraries
%%capture
!pip install datasets setfit
After installing the libraries, run the following code:
### Import dataset
import pandas as pd
from datasets import Dataset
# Sample feedback data, it will have 8 samples per label
feedback_dict = [
{'text': 'The product is great and works well.', 'label': 'Product Performance'},
{'text': 'I love the design of the product.', 'label': 'Product Design'},
{'text': 'The product is difficult to use.', 'label': 'Usability'},
{'text': 'The customer service was very helpful.', 'label': 'Customer Service'},
{'text': 'The product was delivered on time.', 'label': 'Delivery Time'}
]
# Create a DataFrame with the feedback data
df = pd.DataFrame(feedback_dict)
# convert to Dataset format
df = Dataset.from_pandas(df)
### Run training
from setfit import SetFitModel, SetFitTrainer
# Select a model
model = SetFitModel.from_pretrained("sentence-transformers/paraphrase-mpnet-base-v2")
# training with Setfit
trainer = SetFitTrainer(
model=model,
train_dataset=df, # to keep the code simple I do not create the df_train
eval_dataset=df, # to keep the code simple I do not create the df_eval
column_mapping={"text": "text", "label": "label"}
)
trainer.train()
And finally, you can download the trained model on drive and then download it on you PC manually.
### Download model to drive
from google.colab import drive
drive.mount('/content/drive')
trainer.model._save_pretrained('/content/drive/path/to/target/folder')
If your main issue is the training time, this should fix it.