I have a data set on police killings that you can find on Kaggle. There's some missing data in several columns:
UID 0.000000 Name 0.000000 Age 0.018653 Gender 0.000640 Race 0.317429 Date 0.000000 City 0.000320 State 0.000000 Manner_of_death 0.000000 Armed 0.454487 Mental_illness 0.000000 Flee 0.000000 dtype: float64
I created a copy of the original df to encode it and then impute missing values. My plan was:
Index(['Gender', 'Race', 'City', 'State', 'Manner_of_death', 'Armed', 'Mental_illness', 'Flee'], dtype='object')
le = LabelEncoder()
lpf = {}
for col in lepf.columns:
lpf[col] = le.fit_transform(lepf[col])
lpfdf = pd.DataFrame(lpf)
Now I have my dataframe with all categories encoded.
for col in lpfdf: print(col,"\n",len(np.where(pf[col].to_frame().isna())[0]))
Gender 8
Race 3965
City 4 State 0 Manner_of_death 0 Armed 5677 Mental_illness 0
Flee 0
For instance, Gender got three encoded labels: 0 for Male, 1 for Female, and 2 for nan. However, the feature City had >3000 values, and it was not possible to locate it using value_counts(). For that reason, I used:
np.where(pf["City"].to_frame().isna())
Which yielded:
(array([ 4110, 9093, 10355, 10549], dtype=int64), array([0, 0, 0, 0], dtype=int64))
Looking to any of these rows corresponding to the indices, I saw that the nan label for City was 3327:
lpfdf.iloc[10549]
Gender 1
Race 6
City 3327
State 10
Manner_of_death 1
Armed 20
Mental_illness 0
Flee 0
Name: 10549, dtype: int64
Then I proceded to substitute these labels for np.nan:
"""
Gender: 2,
Race: 6,
City: 3327,
Armed: 59
"""
lpfdf["Gender"] = lpfdf["Gender"].replace(2, np.nan)
lpfdf["Race"] = lpfdf["Race"].replace(6, np.nan)
lpfdf["City"] = lpfdf["City"].replace(3327, np.nan)
lpfdf["Armed"] = lpfdf["Armed"].replace(59, np.nan)
itimp = IterativeImputer() iilpf = itimp.fit_transform(lpfdf)
Then make a dataframe for these new imputed values:
itimplpf = pd.DataFrame(np.round(iilpf), columns = lepf.columns)
And finally, when I go to inveres transform to see the corresponding labels it imputed I get the following error:
for col in lpfdf:
le.inverse_transform(itimplpf[col].astype(int))
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-191-fbdde4bb4781> in <module>
1 for col in lpfdf:
----> 2 le.inverse_transform(itimplpf[col].astype(int))
~\anaconda3\lib\site-packages\sklearn\preprocessing\_label.py in inverse_transform(self, y)
158 diff = np.setdiff1d(y, np.arange(len(self.classes_)))
159 if len(diff):
--> 160 raise ValueError(
161 "y contains previously unseen labels: %s" % str(diff))
162 y = np.asarray(y)
ValueError: y contains previously unseen labels: [2 3 4 5]
What is wrong with my steps? Sorry for my long-winded explanation but I felt that I need to explain all the steps so that you can understand the issue properly. Thank you all.
A possibility that might be worth exploring is predicting missing categorical (encoded) values using a machine learning algorithm e.g. sklearn.ensemble.RandomForestClassifier
.
Here, you would train a multiclass classification model for predicting missing values of each of your columns. You'd start by replacing missing values with a magic value (e.g -99), and then one-hot encode them. Next, train a classification model to predict the categorical value of a chosen column, using the one-hot encoded values of the other columns as training data. The training data would, of course, exclude rows where the column to be predicted is missing. Finally, compose a "test" set made from the rows where this column is missing, predict the values, and impute these values into the column. Repeat this for each column that needs to have missing values imputed.
Assuming you want to apply machine learning techniques to this data at a later point, a deeper question is whether the absence of values in some examples of the dataset may in fact carry useful information for predicting your Target, and consequently, whether a particular imputation strategy could corrupt that information.
Edit: Below is an example of what I mean, using dummy data.
import numpy as np
import sklearn
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
#from catboost import CatBoostClassifier
# create some fake data
n_samples = 1000
n_features = 20
features_og, _ = make_classification(n_samples=n_samples, n_features=n_features,n_informative=3, n_repeated= 16, n_redundant = 0)
# convert to fake categorical data
features_og = (features_og*10).astype(int)
# add missing value flag (-99) at random
features = features_og.copy()
for i in range(n_samples):
for j in range(n_features):
if np.random.random() > 0.85:
features[i,j] = -99
# go column by column predicting and replacing missing values
features_fixed = features.copy()
for j in range(n_features):
# do train test split based on whether the selected column value is -99.
train = features[np.where(features[:,j] != -99)]
test = features[np.where(features[:,j] == -99)]
clf = RandomForestClassifier(n_estimators=300, max_depth=5, random_state=42)
# potentially better for categorical features is CatBoost:
#clf = CatBoostClassifier(n_estimators= 300,cat_features=[identify categorical features here])
# train the classifier to predict the value of column j using the other columns
clf.fit(train[:,[x for x in range(n_features) if x != j]], train[:,j])
# predict values for elements of column j that have the missing flag
preds = clf.predict(test[:,[x for x in range(n_features) if x != j]])
# substitute the missing values in column j with the predicted values
features_fixed[np.where(features[:,j] == -99.),j] = preds