I am working on a machine learning model on a survey dataset with highly categorical dataset, each feature (12 features) has been bucketized very sensitively depending on the results and domain intuition.
I am spending time on the dataset splitting decision. I think stratefied splitting would be good to ensure equal distribution of categorical variables. Although the number of featuresa are high so I'd like to hear some input on whether Stratified splitting is best or normal train test split would do.
Haven't tried it anything yet, I am doing some research and circling back to stratefied sampling concepts in ML books.
Always use stratified split if your target or important features have imbalanced categories,it preserves representativeness.
if you have multiple categorical columns, you can combine key ones into a single stratification key or use multi-stratified libraries like skmultilearn.
for high-cardinality or many features, test both splits and check metrics variance, sometimes random split is fine if distribution drift is low.