I'm building a classification model for sleep disorders using Voting Ensemble and I have three base models: Logistic Regression, Random Forest and SVM.
Now I want to combine these models using a Voting Ensemble with soft voting.I don't really have anyone to ask so I came here to ask about the correct approach: I know LR and SVM needs to be scaled but because they all need to be in the same format (as in scaled or raw), I need to scale the rf model too right? I know rf doesn't need scaling but others do, what is the best approach here?
I tried both approaches: training Random Forest on scaled data to match other models, and also manually combining predictions where each model receives its appropriate data format. I expected the second approach to perform better since Random Forest shouldn't need scaling, but wanted to verify this is the correct practice when using VotingClassifier.
You don’t have to scale your Random Forest – it will happily work on raw features
You do need all three models to live in the same VotingClassifier pipeline. The trick is to give each estimator its own preprocessing. Something like this:
pipe_lr = Pipeline([('scale', StandardScaler()),
('lr', LogisticRegression())])
pipe_svc = Pipeline([('scale', StandardScaler()),
('svc', SVC(probability=True))])
pipe_rf = Pipeline([('rf', RandomForestClassifier())])
voting = VotingClassifier(
estimators=[('lr', pipe_lr),
('svc', pipe_svc),
('rf', pipe_rf)],
voting='soft'
)
voting.fit(X_train, y_train)
Note that the RF pipeline has no scaler.
So, by wrapping each base learner in its own pipeline - applying StandardScaler
only to Logistic Regression and SVM and leaving Random Forest on the original features - you ensure that the VotingClassifier receives consistent inputs while each model trains under its ideal conditions, which keeps your code clean and your ensemble’s performance optimal.