machine-learningscikit-learnnlpmultilabel-classificationensemble-learning

Unable to do Stacking for a Multi-label classifier


I am working on a multi-label text classification problem (Total target labels 90). The data distribution has a long tail and class imbalance and around 100k records. I am using the OAA strategy (One against all). I am trying to create an ensemble using Stacking.

Text features : HashingVectorizer(number of features 2**20, char analyzer)
TSVD to reduce the dimensionality (n_components=200).

text_pipeline = Pipeline([
    ('hashing_vectorizer', HashingVectorizer(n_features=2**20,
                                             analyzer='char')),
    ('svd', TruncatedSVD(algorithm='randomized',
                         n_components=200, random_state=19204))])

feat_pipeline = FeatureUnion([('text', text_pipeline)])

estimators_list = [('ExtraTrees',
                    OneVsRestClassifier(ExtraTreesClassifier(n_estimators=30,
                                                             class_weight="balanced",
                                                             random_state=4621))),
                   ('linearSVC',
                    OneVsRestClassifier(LinearSVC(class_weight='balanced')))]
estimators_ensemble = StackingClassifier(estimators=estimators_list,
                                         final_estimator=OneVsRestClassifier(
                                             LogisticRegression(solver='lbfgs',
                                                                max_iter=300)))

classifier_pipeline = Pipeline([
    ('features', feat_pipeline),
    ('clf', estimators_ensemble)])

Error

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-41-ad4e769a0a78> in <module>()
      1 start = time.time()
----> 2 classifier_pipeline.fit(X_train.values, y_train_encoded)
      3 print(f"Execution time {time.time()-start}")
      4 

3 frames
/usr/local/lib/python3.6/dist-packages/sklearn/utils/validation.py in column_or_1d(y, warn)
    795         return np.ravel(y)
    796 
--> 797     raise ValueError("bad input shape {0}".format(shape))
    798 
    799 

ValueError: bad input shape (89792, 83)

Solution

  • StackingClassifier does not support multi label classification as of now. You could get to understand these functionalities by looking at the shape value for the fit parameters such as here.

    Solution would be to put the OneVsRestClassifier wrapper on top of StackingClassifier rather on the individual models.

    Example:

    from sklearn.datasets import make_multilabel_classification
    from sklearn.linear_model import LogisticRegression
    from sklearn.model_selection import train_test_split
    from sklearn.ensemble import ExtraTreesClassifier
    from sklearn.svm import LinearSVC
    from sklearn.ensemble import StackingClassifier
    from sklearn.multiclass import OneVsRestClassifier
    
    X, y = make_multilabel_classification(n_classes=3, random_state=42)
    X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                        test_size=0.33,
                                                        random_state=42)
    
    estimators_list = [('ExtraTrees', ExtraTreesClassifier(n_estimators=30, 
                                                           class_weight="balanced", 
                                                           random_state=4621)),
                       ('linearSVC', LinearSVC(class_weight='balanced'))]
    
    estimators_ensemble = StackingClassifier(estimators=estimators_list,
                                             final_estimator = LogisticRegression(solver='lbfgs', max_iter=300))
    
    ovr_model = OneVsRestClassifier(estimators_ensemble)
    
    ovr_model.fit(X_train, y_train)
    ovr_model.score(X_test, y_test)
    
    # 0.45454545454545453
    
    from sklearn.metrics import confusion_matrix
    confusion_matrix(
        y_train[:, 0], 
        ovr_model.estimators_[0].estimators_[0].predict(X_train),)
    
    #array([[818,   0],
    #       [  0, 522]])
    
    ovr_model.estimators_[0].estimators_[0].feature_importances_
    
    #array([0.05049793, 0.07232525, 0.05278524, 0.08005984, 0.05036507,
    #       0.03674032, 0.06144285, 0.03473714, 0.04080104, 0.05120309,
    #       0.05311589, 0.04119592, 0.03239608, 0.08101098, 0.03522335,
    #       0.03676684, 0.04613645, 0.04755277, 0.05268342, 0.04296053])