I am using KNNImputer to impute np.nan values in several pd.DataFrame. I checked that all the datatypes of each one of the dataframes are numeric. However, KNNImputer drops some columns in some dataframes:
>>>input_df.shape
(816, 216)
>>> input_df.dtypes.value_count()
float64 216
dtype: int64
>>output_df.shape
(816, 27)
I used the following KNNImputer configuration
imputer = KNNImputer(n_neighbors=1,
weights="uniform",
add_indicator=False)
output_df = imputer.fit_transform(input_df)
I would like to know why it is happening since each one of the dataframes have np.nan values. By the way, the parameter n_neighbors=1 should not have any impact in the outcome since I am replacing missing values with the values of the closest neighbor.
I think in your data there could be some columns where there are only np.nan
or empty features for all rows that can cause KNNImputer to drop that column in the output
>>> import numpy as np
>>> import pandas as pd
>>> from sklearn.impute import KNNImputer
>>>
>>> imputer = KNNImputer(n_neighbors=1,
... weights="uniform",
... add_indicator=False)
>>>
>>> df = pd.DataFrame([[1.69, 2.69, np.nan], [3.69, 4.69, 3.69, np.nan], [np.nan, 6.69, 5.69, np.nan], [8.69, 8.69, 7.69, np.nan]])
>>> print(df)
0 1 2 3
0 1.69 2.69 NaN NaN
1 3.69 4.69 3.69 NaN
2 NaN 6.69 5.69 NaN
3 8.69 8.69 7.69 NaN
>>> print(df.shape)
(4, 4)
>>> print(df.dtypes.value_counts())
float64 4
Name: count, dtype: int64
>>>
>>> output_df = imputer.fit_transform(df)
>>> print(output_df.shape)
(4, 3)
I think you can avoid this by setting keep_empty_features
param to True
instead of default False
to avoid removing columns
>>> import numpy as np
>>> import pandas as pd
>>> from sklearn.impute import KNNImputer
>>>
>>> imputer = KNNImputer(n_neighbors=1,
... weights="uniform",
... keep_empty_features=True,
... add_indicator=False)
>>>
>>> df = pd.DataFrame([[1.69, 2.69, np.nan], [3.69, 4.69, 3.69, np.nan], [np.nan, 6.69, 5.69, np.nan], [8.69, 8.69, 7.69, np.nan]])
>>> print(df)
0 1 2 3
0 1.69 2.69 NaN NaN
1 3.69 4.69 3.69 NaN
2 NaN 6.69 5.69 NaN
3 8.69 8.69 7.69 NaN
>>> print(df.shape)
(4, 4)
>>> print(df.dtypes.value_counts())
float64 4
Name: count, dtype: int64
>>>
>>> output_df = imputer.fit_transform(df)
>>> print(output_df.shape)
(4, 4)