pythonmachine-learningscikit-learnclassificationensemble-learning

StackingClassifier with base-models trained on feature subsets


I can best describe my goal using a synthetic dataset. Suppose I have the following:

import numpy as np
import pandas as pd
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression

X, y = make_classification(n_samples=1000, n_features=10, n_classes=3,
                             n_informative=3)

df = pd.DataFrame(X, columns=list('ABCDEFGHIJ'))

X_train, X_test, y_train, y_test = train_test_split(
    df, y, test_size=0.3, random_state=42)

X_train.head()
         A       B           C        D         E       F          G         H       I        J
541 -0.277848 1.022357 -0.950125 -2.100213  0.883638 0.821387  1.154613  0.075376  1.176242 -0.470087
440  1.089665 0.841446 -1.701004 -1.036256 -1.229357 0.345068  1.876470 -0.750067  0.080685 -1.318271
482  0.016010 0.025488 -1.189296 -1.052935 -0.623029 0.669521  1.518927  0.690019 -0.045486 -0.494186
422 -0.133358 -2.16219  1.170989 -0.942150  1.933444 -0.55118 -0.059908 -0.938672 -0.924097 -0.796185
778  0.901954 1.479360 -2.639176 -2.588845 -0.753915 -1.650621 2.727146  0.075260  1.330432 -0.941594

After conducting a feature importance analysis, the discovered that each of the 3-classes in the dataset can best be predicted using feature subset, as oppose to the whole. For example:

class  | optimal predictors
-------+-------------------
   0   |  A, B, C
   1   |  D, E, F, G
   2   |  G, H, I, J
-------+-------------------

At this point, I would like to use 3 one-ve-rest classifiers to train sub-models, one for each class and using the class's best predictors (as the base models). And then a StackingClassifier for final prediction.

I have high-level understanding of the StackingClassifier, where different base models can be trained (e.g. DT, SVC, KNN etc) and a meta classifier using another model e.g. Logistice Regression.

In this case however, the base model is one DT classifier, only that each is to be trained using feature subset best for the class, as above.

Then finally make predictions on the X_test.

But I am not sure how this can be done. So I give the description of my work using pseudo data as above.

How to design this to train the base models, and a final prediction?


Solution

  • You can programmatically do what you describe but I am not sure what would be the gain over using a simple Random Forest that internally does all this (feature subselection and fitting etc).

    Here is an implementation of what you have described. I have used exactly the same base and stacking model as the ones you mentioned:

    import numpy as np
    import pandas as pd
    from sklearn.datasets import make_classification
    from sklearn.model_selection import train_test_split
    from sklearn.tree import DecisionTreeClassifier
    from sklearn.linear_model import LogisticRegression
    from sklearn.ensemble import StackingClassifier
    from sklearn.base import clone
    from sklearn.pipeline import Pipeline
    from sklearn.preprocessing import FunctionTransformer
    
    def select_columns(X, columns):
        return X[columns]
    
    
    X, y = make_classification(n_samples=1000, n_features=10, n_classes=3, n_informative=3)
    df = pd.DataFrame(X, columns=list('ABCDEFGHIJ'))
    X_train, X_test, y_train, y_test = train_test_split(df, y, test_size=0.3, random_state=42)
    
    
    feature_subsets = {
        0: ['A', 'B', 'C'],
        1: ['D', 'E', 'F', 'G'],
        2: ['G', 'H', 'I', 'J']
    }
    
    # Base model
    base_dt_model = DecisionTreeClassifier(random_state=42)
    
    #One-vs-Rest classifiers with feature subsets
    classifiers = []
    for class_label, features in feature_subsets.items():
    
        model = clone(base_dt_model)
        
        # select features, then apply the unique model
        pipeline = Pipeline([
            ('feature_selection', FunctionTransformer(select_columns, kw_args={'columns': features})),
            ('classifier', model)
        ])
        
        classifiers.append(('dt_class_' + str(class_label), pipeline))
    
    # Logistic Regression as the metaclassifier
    stack = StackingClassifier(estimators=classifiers, final_estimator=LogisticRegression())
    
    stack.fit(X_train, y_train)
    
    y_pred = stack.predict(X_test)