rknnimputation

knn imputation algorithm in vim package return wrong results


Based on my understanding the knn algorithm in R VIM package takes k points surrounding a missing point and then aggregates them using a method such as mean, median, etc. If that is the case why does the code below return the wrong results?

ts_data <- c(1, 2, 3, 4, 5, NA, 7, 8, 9)
imputed_ts <- kNN(as.data.frame(as.table(ts_data)), k = 2, numFun = "mean",imp_var=FALSE)
print(imputed_ts)

>>
Var1 Freq
1    A  1.0
2    B  2.0
3    C  3.0
4    D  4.0
5    E  5.0
6    F  1.5
7    G  7.0
8    H  8.0
9    I  9.0 

Why the missing point (F) is 1.5 instead of 6?


Solution

  • The 1.5 you are getting is due to not specifying variable = 2 / dist_var = "Freq" which was causing it to consider Var1 a categorical variable and then average 1 & 2 for A & B (the lowest letters).

    The documentation says kNN distance is based on an extension of gower distance (not Euclidean) with weights applied based on random forest variable importance measures from the ranger package unless otherwise specified, those auto weights apply to all variables including Var1.

    library(VIM)
    ts_data <- data.frame(
      Var1 = c("A", "B", "C", "D", "E", "F", "G", "H", "I"),
      Freq = c(1, 2, 7, 4, 5, NA, 7, 8, 9)
    )
    imputed_ts <- kNN(ts_data, k = 2, numFun = "mean",imp_var=FALSE, variable = 2, dist_var = "Freq", trace = T)
    print(imputed_ts)
    

    If we ignore Var1 and change a few numbers to higher values you'll notice that it is not behaving in any kind of linear way. E.g. at K= 1 it will pick the highest value, for K=2 it seems to pick the average of 8 and 9 because they are the two highest values. If you change position 3 to 70, you'll notice that it again averages the two highest.

    It appears to be considering in a single variable world that NA is infinity and the two values closest to it are the highest values.

    If you rungower.dist(ts_data) from library(StatMatch) with letters, position 6 has dist = 1 in all rows/columns other than itself, and with a single continuous variable row 6 is all NaNs.

    I think this kNN is optimized for large complex datasets so it may work better in that context than for a simple problem like this.