machine-learningsplitdata-science

Stratefied vs Random Splitting on highly categotical datasets


I am working on a machine learning model on a survey dataset with highly categorical dataset, each feature (12 features) has been bucketized very sensitively depending on the results and domain intuition.

I am spending time on the dataset splitting decision. I think stratefied splitting would be good to ensure equal distribution of categorical variables. Although the number of featuresa are high so I'd like to hear some input on whether Stratified splitting is best or normal train test split would do.

Haven't tried it anything yet, I am doing some research and circling back to stratefied sampling concepts in ML books.


Solution