pythonscikit-learnpipelineoutliersk-fold

Problems creating a transformer for a pipeline


Right now I'm trying to create a pipeline that initially use Random Oversampling, and the second step I want to use is a custom outlier remover, but I'm having problems executing that pipeline.

That is my code por the pipeline and all the process:

accuracy_lst = []
precision_lst = []
recall_lst = []
f1_lst = []
auc_lst = []

kf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
log_reg_params = {"penalty": ['l1', 'l2'], 'C': [0.001, 0.01, 0.1, 1, 10, 100, 1000]}
rand_log_reg = RandomizedSearchCV(LogisticRegression(max_iter = 200), log_reg_params, n_iter=4)

for train, test in kf.split(Org_X_train, Org_y_train):
    X_train, X_test = Org_X_train.iloc[train], Org_X_train.iloc[test]
    y_train, y_test = Org_y_train.iloc[train], Org_y_train.iloc[test]
    print(X_train.index)
    pipeline = make_pipeline(RandomOverSampler(random_state=42), OutlierRemover(columns=['V14', 'V12', 'V10', 'V4', 'V11', 'V2']), rand_log_reg)

    print(X_train.index)
    model = pipeline.fit(X_train, y_train)
    best_est = rand_log_reg.best_estimator_
    prediction = best_est.predict(X_test)

    accuracy_lst.append(accuracy_score(y_test, prediction))
    precision_lst.append(precision_score(y_test, prediction))
    recall_lst.append(recall_score(y_test, prediction))
    f1_lst.append(f1_score(y_test, prediction))
    auc_lst.append(roc_auc_score(y_test, prediction))

print("Accuracy:", np.mean(accuracy_lst))
print("Precision:", np.mean(precision_lst))
print("Recall:", np.mean(recall_lst))
print("F1 Score:",  np.mean(f1_lst))
print("AUC Score:",  np.mean(auc_lst))

and that is the code of the autlier extractor:

class OutlierRemover(BaseEstimator, TransformerMixin):
    def __init__(self, columns):
        self.columns = columns

    def fit(self, X, y=None):
        self.X=X
        self.y=y
        return self

    def transform(self, X, y=None):
        new_X = X.copy()
        for col in self.columns:
            q25, q75 = np.percentile(new_X[col], 25), np.percentile(new_X[col], 75)
            iqr = q75 - q25
            cut_off = iqr * 1.5
            lower, upper = q25 - cut_off, q75 + cut_off
            indices_to_drop = new_X[(new_X[col] > upper) | (new_X[col] < lower)].index
            new_X = new_X.drop(indices_to_drop)
        if y is not None:
            new_y = y.drop(indices_to_drop)
            return new_X, new_y
        else:
            return new_X

The error ir "ValueError: Found input variables with inconsistent numbers of samples: [310231, 363920]" because the X is reduced but y no, I tried different things but nothing works for me.


Solution

  • You need to use (something like) imblearn's pipeline if you want to modify y. It seems like you must be already, since you use oversampling.

    So then your outlier removal just needs to comply with imblearn standards: you should define a resample method instead of transform, returning both X and y.