[SOLVED] Best way to use KNNimputer?

Best way to use KNNimputer?

I want to impute missing values by KNN, and I use this method to select best K:

for i, k in enumerate(neighbors):
    knn = KNeighborsClassifier(n_neighbors=k)
    knn.fit(X_train, y_train)
    train_accurate[i] = knn.score(X_train, y_train)
    test_accurate[i] = knn.score(X_test, y_test)

And then I apply KNNImputer with n= best accurate. Does KNNImputer need this step or it check that by himself? If this step is efficient, is there a shorted version to don't split train/test?

Solution

There is actually one way to check best K where there is no need to split between train & test.

The method is to study the Density with different K numbers, but it is just for One variable (I will select the one with more imputations needed). The one nearest to original distribution is the best to select.

n_neighbors = [1, 2, 3, 5, 7, 9, 20, 30]

fig, ax = plt.subplots(figsize=(16, 8))
# Plot the original distribution
sb.kdeplot(df.variableselected, label="Original Distribution")
for k in n_neighbors:
    knn_imp = KNNImputer(n_neighbors=k)
    density.loc[:, :] = knn_imp.fit_transform(datos)
    sb.kdeplot(density.variableselected, label=f"Imputed Dist with k={k}")

plt.legend()

On the example saw down, any K is same accurate, but this can variate depending of the Data