pythonmachine-learningscikit-learndata-scienceimputation

Understanding sklearn's KNNImputer


I was going through its documentation and it says

Each sample’s missing values are imputed using the mean value from n_neighbors nearest neighbors found in the training set. Two samples are close if the features that neither are missing are close.

Now, playing around with a toy dataset, i.e.

>>>X = [[1, 2, nan], [3, 4, 3], [nan, 6, 5], [8, 8, 7]]
>>>X

   [[ 1.,  2., nan],
    [ 3.,  4.,  3.],
    [nan,  6.,  5.],
    [ 8.,  8.,  7.]]

And we make a KNNImputer as follows:

imputer = KNNImputer(n_neighbors=2)

The question is, how does it fill the nans while having nans in 2 of the columns? For example, if it is to fill the nan in the 3rd column of the 1st row, how will it choose which features are the closest since one of the rows has nan in the first column as well? When I do imputer.fit_transform(X) it gives me

array([[1. , 2. , 4. ],
       [3. , 4. , 3. ],
       [5.5, 6. , 5. ],
       [8. , 8. , 7. ]])

which means for filling out the nan in row 1, the nearest neighbors were the second and the third row. How did it calculate the euclidean distance between the first and the third row?


Solution

  • How does it fill the NaNs using rows that also have NaNs?

    This doesn't seem to be mentioned in the docs. But by digging a bit into the source code, it appears that for each column being imputed, all donors at a smaller distance are considered, even if they have missing values. The way this is handled is by setting to 0 the missing values in a weight matrix, which is obtained according to the used distance, see _get_weights.

    The relevant code is in _calc_impute, where after finding a distance matrix for all potential donors, and then the above mentioned matrix of weights, it is imputed as:

    # fill nans with zeros
    if weight_matrix is not None:
        weight_matrix[np.isnan(weight_matrix)] = 0.0
    

    Where all potential donors are considered if they have at least one non-nan distance with the reciever

    dist_pot_donors : ndarray of shape (n_receivers, n_potential_donors)
        Distance matrix between the receivers and potential donors from
        training set. There must be at least one non-nan distance between
        a receiver and a potential donor.
    

    We could check this with a toy example; in the following matrix, when inputting the missing value in [nan, 7., 4., 5.], the last row (which also contains two NaNs) is chosen (note that I've set n_neighbors=1). This is because the distance wrt the last row is 0, as the distance corresponding to the NaN values has been set to 0. So by just having a minimal difference with rows 2 and 3, the last row is chosen since it is seen as being equal:

    X = np.array([[np.nan,7,4,5],[2,8,4,5],[3,7,4,6],[1,np.nan,np.nan,5]])
    
    print(X)
    array([[nan,  7.,  4.,  5.],
           [ 2.,  8.,  4.,  5.],
           [ 3.,  7.,  4.,  6.],
           [ 1., nan, nan,  5.]])
    
    from sklearn.impute import KNNImputer
    imputer = KNNImputer(n_neighbors=1)
    
    imputer.fit_transform(X)
    array([[1., 7., 4., 5.],
           [2., 8., 4., 5.],
           [3., 7., 4., 6.],
           [1., 7., 4., 5.]])