I am using DBSCAN on my training datatset in order to find outliers and remove those outliers from the dataset before training model. I am using DBSCAN on my train rows 7697 with 8 columns.Here is my code
from sklearn.cluster import DBSCAN
X = StandardScaler().fit_transform(X_train[all_features])
model = DBSCAN(eps=0.3 , min_samples=10).fit(X)
print (model)
X_train_1=X_train.drop(X_train[model.labels_==-1].index).copy()
X_train_1.reset_index(drop=True,inplace=True)
Q-1 Out of these 7 some are discrete and some are continuous , is it ok to scale discrete and continuous both or just continuous? Q-2 Do i need to map cluster to test data as it learned from training?
DBSCAN will handle those outliers for you. That's what is was built for. See the example below and post back if you have additional questions.
import seaborn as sns
import pandas as pd
titanic = sns.load_dataset('titanic')
titanic = titanic.copy()
titanic = titanic.dropna()
titanic['age'].plot.hist(
bins = 50,
title = "Histogram of the age variable"
)
from scipy.stats import zscore
titanic["age_zscore"] = zscore(titanic["age"])
titanic["is_outlier"] = titanic["age_zscore"].apply(
lambda x: x <= -2.5 or x >= 2.5
)
titanic[titanic["is_outlier"]]
ageAndFare = titanic[["age", "fare"]]
ageAndFare.plot.scatter(x = "age", y = "fare")
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
ageAndFare = scaler.fit_transform(ageAndFare)
ageAndFare = pd.DataFrame(ageAndFare, columns = ["age", "fare"])
ageAndFare.plot.scatter(x = "age", y = "fare")
from sklearn.cluster import DBSCAN
outlier_detection = DBSCAN(
eps = 0.5,
metric="euclidean",
min_samples = 3,
n_jobs = -1)
clusters = outlier_detection.fit_predict(ageAndFare)
clusters
from matplotlib import cm
cmap = cm.get_cmap('Accent')
ageAndFare.plot.scatter(
x = "age",
y = "fare",
c = clusters,
cmap = cmap,
colorbar = False
)