Right now I'm trying to create a pipeline that initially use Random Oversampling, and the second step I want to use is a custom outlier remover, but I'm having problems executing that pipeline.
That is my code por the pipeline and all the process:
accuracy_lst = []
precision_lst = []
recall_lst = []
f1_lst = []
auc_lst = []
kf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
log_reg_params = {"penalty": ['l1', 'l2'], 'C': [0.001, 0.01, 0.1, 1, 10, 100, 1000]}
rand_log_reg = RandomizedSearchCV(LogisticRegression(max_iter = 200), log_reg_params, n_iter=4)
for train, test in kf.split(Org_X_train, Org_y_train):
X_train, X_test = Org_X_train.iloc[train], Org_X_train.iloc[test]
y_train, y_test = Org_y_train.iloc[train], Org_y_train.iloc[test]
print(X_train.index)
pipeline = make_pipeline(RandomOverSampler(random_state=42), OutlierRemover(columns=['V14', 'V12', 'V10', 'V4', 'V11', 'V2']), rand_log_reg)
print(X_train.index)
model = pipeline.fit(X_train, y_train)
best_est = rand_log_reg.best_estimator_
prediction = best_est.predict(X_test)
accuracy_lst.append(accuracy_score(y_test, prediction))
precision_lst.append(precision_score(y_test, prediction))
recall_lst.append(recall_score(y_test, prediction))
f1_lst.append(f1_score(y_test, prediction))
auc_lst.append(roc_auc_score(y_test, prediction))
print("Accuracy:", np.mean(accuracy_lst))
print("Precision:", np.mean(precision_lst))
print("Recall:", np.mean(recall_lst))
print("F1 Score:", np.mean(f1_lst))
print("AUC Score:", np.mean(auc_lst))
and that is the code of the autlier extractor:
class OutlierRemover(BaseEstimator, TransformerMixin):
def __init__(self, columns):
self.columns = columns
def fit(self, X, y=None):
self.X=X
self.y=y
return self
def transform(self, X, y=None):
new_X = X.copy()
for col in self.columns:
q25, q75 = np.percentile(new_X[col], 25), np.percentile(new_X[col], 75)
iqr = q75 - q25
cut_off = iqr * 1.5
lower, upper = q25 - cut_off, q75 + cut_off
indices_to_drop = new_X[(new_X[col] > upper) | (new_X[col] < lower)].index
new_X = new_X.drop(indices_to_drop)
if y is not None:
new_y = y.drop(indices_to_drop)
return new_X, new_y
else:
return new_X
The error ir "ValueError: Found input variables with inconsistent numbers of samples: [310231, 363920]" because the X is reduced but y no, I tried different things but nothing works for me.
You need to use (something like) imblearn
's pipeline if you want to modify y
. It seems like you must be already, since you use oversampling.
So then your outlier removal just needs to comply with imblearn
standards: you should define a resample
method instead of transform
, returning both X
and y
.