pythonmachine-learningtrain-test-splitimbalanced-data

How can I properly split imbalanced dataset to train and test set?


I have a flight delay dataset and try to split the set to train and test set before sampling. On-time cases are about 80% of total data and delayed cases are about 20% of that.

Normally in machine learning ratio of train and test set size is 8:2. But the data is too imbalanced. So considering extreme case, most of train data are on-time cases and most of test data are delayed cases and accuracy will be poor.

So my question is How can I properly split imbalanced dataset to train and test set??


Solution

  • Probably just by playing with ratio of train and test you might not get the correct prediction and results.

    if you are working on imbalanced dataset, you should try re-sampling technique to get better results. In case of imbalanced datasets the classifier always "predicts" the most common class without performing any analysis of the features.

    Also use different metric for performance measurement such as F1 Score etc in case of imbalanced data set

    Please go through the below link, it will give you more clarity.

    What is the correct procedure to split the Data sets for classification problem?

    Cleveland heart disease dataset - can’t describe the class