I'm using AdaBoostClassifier
with a weak learner (DecisionTreeClassifier
) to classify a dataset. The dataset has 7857 samples:
X.shape
# Output: (7857, 5)
y.shape
# Output: (7857,)
Here’s the code for splitting the dataset and training the model:
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.25, random_state=28
)
weak_learner = DecisionTreeClassifier(max_depth=1)
adb = AdaBoostClassifier(estimator=weak_learner, n_estimators=50, random_state=42)
adb_model = adb.fit(X_train, y_train)
y_pred = adb_model.predict(X_test)
print(classification_report(y_test, y_pred))
When I run this code with test_size=0.25
, the output for the classification metrics is 100% for all categories:
precision recall f1-score support
Cheap 1.00 1.00 1.00 496
Expensive 1.00 1.00 1.00 506
Reasonable 1.00 1.00 1.00 963
accuracy 1.00 1965
macro avg 1.00 1.00 1.00 1965
weighted avg 1.00 1.00 1.00 1965
This cannot be true, as my data points are not perfectly separable. (I checked with a graph)
However, when I change the test_size
to any other value (e.g., 0.3
, 0.2
), I get the following error:
ValueError: Found input variables with inconsistent numbers of samples
What I've Checked:
X
and y
have the same number of samples.X
or y
.Questions:
test_size=0.25
produce perfect metrics, but other test_size
values result in an error?test_size
values?The test_size = 0.25
doesn’t really have an impact on the metrics, I don’t think. But your model is too good, probably because the function that the label follows is very simple. So your Adaboost doesn’t need more than a DecisionTree(depth=1)
to learn that function.
But, you’re probably using the same name y_pred
in all the code and thus, y_pred
should be actualized when you make any changes on the model or on the test dataset. The test_size
actually modifies the length of the test dataset and the previous length was 25% of the total size of the dataset. If you modify the test_size
, you modify that test dataset and you need to actualize y_pred
with adb.predict(X_test)
, both for having new prediction matching with new datapoints, and not have a mismatch error.
To fix the issue, you just need to to add the following before using any function that needs y_test
and y_pred
:
y_pred = adb.predict(X_test)