pythonknn

Best way to use KNNimputer?


I want to impute missing values by KNN, and I use this method to select best K:

for i, k in enumerate(neighbors):
    knn = KNeighborsClassifier(n_neighbors=k)
    knn.fit(X_train, y_train)
    train_accurate[i] = knn.score(X_train, y_train)
    test_accurate[i] = knn.score(X_test, y_test)

And then I apply KNNImputer with n= best accurate. Does KNNImputer need this step or it check that by himself? If this step is efficient, is there a shorted version to don't split train/test?


Solution

  • There is actually one way to check best K where there is no need to split between train & test.

    The method is to study the Density with different K numbers, but it is just for One variable (I will select the one with more imputations needed). The one nearest to original distribution is the best to select.

    n_neighbors = [1, 2, 3, 5, 7, 9, 20, 30]
    
    fig, ax = plt.subplots(figsize=(16, 8))
    # Plot the original distribution
    sb.kdeplot(df.variableselected, label="Original Distribution")
    for k in n_neighbors:
        knn_imp = KNNImputer(n_neighbors=k)
        density.loc[:, :] = knn_imp.fit_transform(datos)
        sb.kdeplot(density.variableselected, label=f"Imputed Dist with k={k}")
    
    plt.legend()
    

    On the example saw down, any K is same accurate, but this can variate depending of the Data

    enter image description here