python text-classification semisupervised-learning

Pseudo Labelling on Text Classification Python

I'm not good at machine learning. Can someone tell me how to doing text classification with pseudo labeling in python? I never know the right implementation, I have searched everywhere in internet, but I give up as found anything :'( I just found the implementation for numeric datasets, but I found no implementation for text classification (vectorized text).. So I wrote this syntax, but I don't know whether my code is correct or not. Am I doing wrong? Please help me guys, I really need your help.. :'(

This is my datasets if you wanna try. I want to classify 'Label' from 'Content'

My steps are:

Split data 0.75 unlabeled, 0.25 labeled
From 0.25 labeld I split: 0.75 train labeled, and 0.25 test labeled
Make vectorizer for train, test and unlabeled datasets
Build first model from train labeled, then labelling the unlabeled datasets
Concatting train labeled data with prediction of unlabeled that have >0.99 (pseudolabeled), and make the second model
Remove pseudolabeled from unabeled datasets
Predict the remaining unlabeled from second model, then iterate step 3 until the probability of predicted pseudolabeled <0.99.

This is my code:

Performing pseudo labelling on text classification

from sklearn.naive_bayes import MultinomialNB

# Initiate iteration counter
iterations = 0

# Containers to hold f1_scores and # of pseudo-labels
train_f1s = []
test_f1s = []
pseudo_labels = []

# Assign value to initiate while loop
high_prob = [1] 

# Loop will run until there are no more high-probability pseudo-labels
while len(high_prob) > 0:
    
    # Set the vector transformer (from data train)
    columnTransformer = ColumnTransformer([
    ('tfidf',TfidfVectorizer(stop_words=None, max_features=100000),
     'Content')
    ],remainder='drop')

    def transforms(series):
        before_vect = pd.DataFrame({'Content':series})
        vector_transformer = columnTransformer.fit(pd.DataFrame({'Content':X_train}))
        return vector_transformer.transform(before_vect)

    X_train_df = transforms(X_train);
    X_test_df = transforms(X_test);
    X_unlabeled_df = transforms(X_unlabeled)
    
    # Fit classifier and make train/test predictions
    nb = MultinomialNB()
    nb.fit(X_train_df, y_train)
    y_hat_train = nb.predict(X_train_df)
    y_hat_test = nb.predict(X_test_df)

    # Calculate and print iteration # and f1 scores, and store f1 scores
    train_f1 = f1_score(y_train, y_hat_train)
    test_f1 = f1_score(y_test, y_hat_test)
    print(f"Iteration {iterations}")
    print(f"Train f1: {train_f1}")
    print(f"Test f1: {test_f1}")
    train_f1s.append(train_f1)
    test_f1s.append(test_f1)
   
    # Generate predictions and probabilities for unlabeled data
    print(f"Now predicting labels for unlabeled data...")

    pred_probs = nb.predict_proba(X_unlabeled_df)
    preds = nb.predict(X_unlabeled_df)
    prob_0 = pred_probs[:,0]
    prob_1 = pred_probs[:,1]

    # Store predictions and probabilities in dataframe
    df_pred_prob = pd.DataFrame([])
    df_pred_prob['preds'] = preds
    df_pred_prob['prob_0'] = prob_0
    df_pred_prob['prob_1'] = prob_1
    df_pred_prob.index = X_unlabeled.index
    
    # Separate predictions with > 99% probability
    high_prob = pd.concat([df_pred_prob.loc[df_pred_prob['prob_0'] > 0.99],
                           df_pred_prob.loc[df_pred_prob['prob_1'] > 0.99]],
                          axis=0)
    
    print(f"{len(high_prob)} high-probability predictions added to training data.")
    
    pseudo_labels.append(len(high_prob))

    # Add pseudo-labeled data to training data
    X_train = pd.concat([X_train, X_unlabeled.loc[high_prob.index]], axis=0)
    y_train = pd.concat([y_train, high_prob.preds])      
    
    # Drop pseudo-labeled instances from unlabeled data
    X_unlabeled = X_unlabeled.drop(index=high_prob.index)
    
    print(f"{len(X_unlabeled)} unlabeled instances remaining.\n")
    
    # Update iteration counter
    iterations += 1

I think I'm doing something wrong.. Because when I see the f1 scores it is decreasing. Please help me guys :'( I'm stressed. f1 scores image

=================EDIT=================

So I've search on journal, then I think that I've got misunderstanding about the concept of data splitting in pseudo-labelling.

I initially thought that, the steps starts from splitting the data into labeled and unlabeled data, then from that labeled data, it was splitted into train and test.

But after surfing and searching, I found in this journal that my steps is incorrect. This journal says that the steps pseudo-labeling should start from splitting the data into train and test sets at first, and then from that train sets, data is splited to labeled and unlabeled datasets.

According to that journal, it reach the best result when splitting data into 90% of train sets and 10% of test sets. Then, from that 90% train set, it is splitted into 20% labeled data and 80% unlabeled data sets. This journal trying evidence range from 0.7 till 0.9 as boundary to drop the pseudo labeling, and on that proportion of splitting, the best evidence threshold value is 0.74. So I fix my steps with that new proportion and 0.74 threshold, and I finally got the F1 scores is increasing. Here are my steps:

Split data 0.9 train, 0.1 test sets (I labeled the test sets, so I can measure the f1 scores)
From 0.9 train, I split: 0.2 labeled, and 0.8 unlabeled data
Making vectorizer for X value of labeled train, test and unlabeled training datasets
Build first model from labeled train, then labeling the unlabeled training datasets. Then measure the F-1 scores according to the test sets (that already labeled).
Concatting train labeled data with prediction of unlabeled that have probability > 0.74 (threshold based on journal). We call this new data as pseudo-labelled, likened to the actual label), and make the second model from new train data sets.
Remove selected pseudo-labelled from unlabeled datasets
Use the second model to predict the remaining of unlabeled data, then iterate step 3 until there are no probability of predicted pseudo-labelled>0.74
So the last model is the final.

My syntax is still the same, I just changing the split proportion and I finally got my f1 scores increasing through 4 iterations: my new f1 scores.

Am I doing something right? Thank you for all of your attention guys.. So much thank you..

Solution

I'm not good at machine learning.

Overall I would say that you are quite good at Machine Learning: semi-supervised learning is an advanced type of problem and I think your solution is quite good. At least the general principle seems correct, but it's difficult to say for sure (I don't have time to analyze the code in detail sorry). A few comments:

One thing which might be improvable is the 0.74 threshold: this value certainly depends on the data, so you could do your own experiment by trying different threshold values and selecting the one which works best with your data.
Preferably it would be better to keep a final test set aside and use a separate validation set during the iterations. This would avoid the risk of data leakage.
I'm not sure about the stop condition for the loop. It might be ok but it might be worth trying other options:
- Simply iterate a fixed number of times (for instance 10 times).
- The stop condition could be based on "no more F1-score improvement" (i.e. stabilization of the performance), but it's a bit more advanced.

It's pretty good anyway, my comments are just ideas if you want to improve further. Note that it's been a long time since I've work with semi-supervised, I'm not sure I remember everything very well ;)