machine-learningtext-classificationoversampling

Oversampled train set and test set - machine learning classification


Let's say that I have oversampled my training set after splitting, then I selected the features of interest to be extracted based on the training set analysis.

After this, do I use the oversampled training set with the testing set together to determine the classification performance (accuracy, precision, F1 measure, and etc) OR I just use the testing set for it?


Solution

  • (Not really a programming question but it's important enough to be clarified imho)

    To measure performance reliably you must use the original test set, without any resampling.

    This is one of the reasons why the train/test split should always be done first, the test set should be kept "fresh". Resampling the test set would be like cheating, because it makes the problem easier to solve.

    Note: in general resampling rarely works, especially with text.