I am working on a multi-label text classification problem (Total target labels 90). The data distribution has a long tail and class imbalance and around 100k records. I am using the OAA strategy (One against all). I am trying to create an ensemble using Stacking.
Text features : HashingVectorizer
(number of features 2**20, char analyzer)
TSVD to reduce the dimensionality (n_components=200).
text_pipeline = Pipeline([
('hashing_vectorizer', HashingVectorizer(n_features=2**20,
analyzer='char')),
('svd', TruncatedSVD(algorithm='randomized',
n_components=200, random_state=19204))])
feat_pipeline = FeatureUnion([('text', text_pipeline)])
estimators_list = [('ExtraTrees',
OneVsRestClassifier(ExtraTreesClassifier(n_estimators=30,
class_weight="balanced",
random_state=4621))),
('linearSVC',
OneVsRestClassifier(LinearSVC(class_weight='balanced')))]
estimators_ensemble = StackingClassifier(estimators=estimators_list,
final_estimator=OneVsRestClassifier(
LogisticRegression(solver='lbfgs',
max_iter=300)))
classifier_pipeline = Pipeline([
('features', feat_pipeline),
('clf', estimators_ensemble)])
Error
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-41-ad4e769a0a78> in <module>()
1 start = time.time()
----> 2 classifier_pipeline.fit(X_train.values, y_train_encoded)
3 print(f"Execution time {time.time()-start}")
4
3 frames
/usr/local/lib/python3.6/dist-packages/sklearn/utils/validation.py in column_or_1d(y, warn)
795 return np.ravel(y)
796
--> 797 raise ValueError("bad input shape {0}".format(shape))
798
799
ValueError: bad input shape (89792, 83)
StackingClassifier
does not support multi label classification as of now. You could get to understand these functionalities by looking at the shape value for the fit
parameters such as here.
Solution would be to put the OneVsRestClassifier wrapper on top of StackingClassifier
rather on the individual models.
Example:
from sklearn.datasets import make_multilabel_classification
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.svm import LinearSVC
from sklearn.ensemble import StackingClassifier
from sklearn.multiclass import OneVsRestClassifier
X, y = make_multilabel_classification(n_classes=3, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.33,
random_state=42)
estimators_list = [('ExtraTrees', ExtraTreesClassifier(n_estimators=30,
class_weight="balanced",
random_state=4621)),
('linearSVC', LinearSVC(class_weight='balanced'))]
estimators_ensemble = StackingClassifier(estimators=estimators_list,
final_estimator = LogisticRegression(solver='lbfgs', max_iter=300))
ovr_model = OneVsRestClassifier(estimators_ensemble)
ovr_model.fit(X_train, y_train)
ovr_model.score(X_test, y_test)
# 0.45454545454545453
from sklearn.metrics import confusion_matrix
confusion_matrix(
y_train[:, 0],
ovr_model.estimators_[0].estimators_[0].predict(X_train),)
#array([[818, 0],
# [ 0, 522]])
ovr_model.estimators_[0].estimators_[0].feature_importances_
#array([0.05049793, 0.07232525, 0.05278524, 0.08005984, 0.05036507,
# 0.03674032, 0.06144285, 0.03473714, 0.04080104, 0.05120309,
# 0.05311589, 0.04119592, 0.03239608, 0.08101098, 0.03522335,
# 0.03676684, 0.04613645, 0.04755277, 0.05268342, 0.04296053])