pythonkerasscikit-learnhuggingface-transformerssmote

SMOTE with multiple bert inputs


I'm building a multiclass text classification model using Keras and Bert (HuggingFace), but I have a very imbalanced dataset. I've used SMOTE from Sklearn in order to generate additional samples for the underbalanced classes (I have 45 in total), which works fine when I use the input ids from the Bert Tokenizer.

However, I would like to be able to also use smote for the input masks ids, in order to allow the model to determine where the padded values are.

My question is how can I use smote for both input ids and mask ids? I've done the following so far, and the model doesn't complain, but I'm not sure if the resampled masks match the resampled input ids row for row. Smote requires two inputs, inputs and labels, so I've duplicated the process with the same random state, and just returned the required elements:

def smote(input_ids, input_masks, labels):

    smote = SMOTE("not majority", random_state=27)

    input_ids_resampled, labels_resampled = smote.fit_sample(input_ids, labels)
    input_masks_resampled, _ = smote.fit_sample(input_masks, labels)

    return input_ids_resampled, input_masks_resampled, labels_resampled

Is this acceptable? Is there a better way to do this?


Solution

  • I just want to clarify this that this is a wrong way to apply SMOTE to input_ ids. You need to take the corresponding embedding to the CLS. Use BERT to get CLS token for each tweet then applies SMOTE to it. Then pass it from Classifier (any classifier). This should be done without fine-tuning.