machine-learningscikit-learnensemble-learningmlxtend

Does the number of classifiers on stacking classifier have to be equal to the number of columns of my training/testing dataset?


I'm trying to solve a binary classification task. The training data set contains 9 features and after my feature engineering I ended having 14 features. I want to use a stacking classifier approach with mlxtend.classifier.StackingClassifier by using 4 different classifiers, but when trying to predict the test datata set I got the error: ValueError: query data dimension must match training data dimension

%%time
models=[KNeighborsClassifier(weights='distance'),
        GaussianNB(),SGDClassifier(loss='hinge'),XGBClassifier()]
calibrated_models=Calibrated_classifier(models,return_names=False)
meta=LogisticRegression()
stacker=StackingCVClassifier(classifiers=calibrated_models,meta_classifier=meta,use_probas=True).fit(X.values,y.values)

Remark: In my code I just programmed a function to return a list with calibrated classifiers StackingCVClassifier I have checked this is not causing the error

Remark 2: I had already tried to perform a stacker from scratch with the same results so I had thought It was something wrong with my own stacker

from sklearn.linear_model import LogisticRegression
def StackingClassifier(X,y,models,stacker=LogisticRegression(),return_data=True):
  names,ls=[],[]
  predictions=pd.DataFrame()
  for model in models:
    names.append(str(model)[:str(model).find('(')])

  for i,model in enumerate(models):
    model.fit(X,y)
    ls=model.predict_proba(X)[:,1]
    predictions[names[i]]=ls
  if return_data:
    return predictions
  else:
    return stacker.fit(predictions,y)

Could you please help me to understand the correct usage of a stacking classifiers?

enter image description here

EDIT: This is my code for calibrated classifier. This function takes a list of n classifiers and apply sklearn fucntion CalibratedClassifierCV to each one and returns a list with n calibrated classifiers. You have an option to return as a zip list since this function is mainly intended to be used along with sklearn's VotingClassifier

def Calibrated_classifier(models,method='sigmoid',return_names=True):
  calibrated,names=[],[]
  for model in models:
    names.append(str(model)[:str(model).find('(')])

  for model in models:
    clf=CalibratedClassifierCV(base_estimator=model,method=method)
    calibrated.append(clf)
  if return_names:
    return zip(names,calibrated)
  else: 
    return calibrated

Solution

  • I have tried your code with Iris dataset. It is working fine, I think the problem is with the dimension of your test data and not with the calibration.

    from sklearn.linear_model import LogisticRegression
    from mlxtend.classifier import StackingCVClassifier
    from sklearn import datasets
    X, y = datasets.load_iris(return_X_y=True)
    
    
    models=[KNeighborsClassifier(weights='distance'),
            SGDClassifier(loss='hinge')]
    calibrated_models=Calibrated_classifier(models,return_names=False)
    meta=LogisticRegression( multi_class='ovr')
    stacker = StackingCVClassifier(classifiers=calibrated_models,
                                   meta_classifier=meta,use_probas=True,cv=3).fit(X,y)
    

    Prediction

    stacker.predict([X[0]])
    #array([0])