I have a pandas dataframe with some NaN values and I am trying to use the KNN imputer to fill them. I want the imputer to pick 'neighbors' based on a specific parameter, in this case it should only impute based on values with the same "patient_id". The missing values are some medical analysis results.
I tried to solve this problem by creating a list of unique "patient_id", using:
patient_list=data['patient_id'].unique()
then I iterated through the list with 'patient_id' masking, then merging all the sub-dataframes together, with:
from sklearn.impute import KNNImputer
knn = KNNImputer(missing_values=np.nan)
data_imputed = pd.DataFrame()
for patient_id in patient_list:
X = knn.fit_transform(data[data['patient_id']==patient_id])
X_ = pd.DataFrame(X, columns = data.columns)
data_imputed.merge(X_, on=['patient_id','visit_month','visit_id'], how='left', copy=False)
but it is giving me a ValueError:
ValueError: Shape of passed values is (4, 1187), indices imply (4, 1198)
My original dataframe has 1198 columns, so how did 11 columns go missing? Thank you for helping!
from sklearn.impute import KNNImputer
knn = KNNImputer(missing_values=np.nan)
data_imputed = []
for patient_id in patient_list:
X = knn.fit_transform(data[data['patient_id']==patient_id])
X_ = pd.DataFrame(X, columns = data.columns)
data_imputed.append(X_, on=['patient_id','visit_month','visit_id'], how='left', copy=False)
data_imputed = pd.concat(data_imputed)