dataframemachine-learningscikit-learnknnimputation

can sklearn's KNN Imputer work with specific rows within a dataframe?


I have a pandas dataframe with some NaN values and I am trying to use the KNN imputer to fill them. I want the imputer to pick 'neighbors' based on a specific parameter, in this case it should only impute based on values with the same "patient_id". The missing values are some medical analysis results.

I tried to solve this problem by creating a list of unique "patient_id", using:

patient_list=data['patient_id'].unique()

then I iterated through the list with 'patient_id' masking, then merging all the sub-dataframes together, with:

from sklearn.impute import KNNImputer
knn = KNNImputer(missing_values=np.nan)

data_imputed = pd.DataFrame()

for patient_id in patient_list:
    X = knn.fit_transform(data[data['patient_id']==patient_id])
    X_ = pd.DataFrame(X, columns = data.columns)
    data_imputed.merge(X_, on=['patient_id','visit_month','visit_id'], how='left', copy=False)

but it is giving me a ValueError:

ValueError: Shape of passed values is (4, 1187), indices imply (4, 1198)

My original dataframe has 1198 columns, so how did 11 columns go missing? Thank you for helping!


Solution

  • from sklearn.impute import KNNImputer
    knn = KNNImputer(missing_values=np.nan)
    
    data_imputed = []
    
    for patient_id in patient_list:
        X = knn.fit_transform(data[data['patient_id']==patient_id])
        X_ = pd.DataFrame(X, columns = data.columns)
        data_imputed.append(X_, on=['patient_id','visit_month','visit_id'], how='left', copy=False)
    
    data_imputed = pd.concat(data_imputed)