I want to impute missing values by KNN, and I use this method to select best K:
for i, k in enumerate(neighbors):
knn = KNeighborsClassifier(n_neighbors=k)
knn.fit(X_train, y_train)
train_accurate[i] = knn.score(X_train, y_train)
test_accurate[i] = knn.score(X_test, y_test)
And then I apply KNNImputer with n= best accurate. Does KNNImputer need this step or it check that by himself? If this step is efficient, is there a shorted version to don't split train/test?
There is actually one way to check best K where there is no need to split between train & test.
The method is to study the Density with different K numbers, but it is just for One variable (I will select the one with more imputations needed). The one nearest to original distribution is the best to select.
n_neighbors = [1, 2, 3, 5, 7, 9, 20, 30]
fig, ax = plt.subplots(figsize=(16, 8))
# Plot the original distribution
sb.kdeplot(df.variableselected, label="Original Distribution")
for k in n_neighbors:
knn_imp = KNNImputer(n_neighbors=k)
density.loc[:, :] = knn_imp.fit_transform(datos)
sb.kdeplot(density.variableselected, label=f"Imputed Dist with k={k}")
plt.legend()
On the example saw down, any K is same accurate, but this can variate depending of the Data