I am using OneHotEncoder to create a series of dummy variables based on a categoric variable. The problem I encounter is that any missing values are not transfered to the available dummy variables.
oh = OneHotEncoder(min_frequency = 0.0001, sparse_output = False) ### df_experience_level
data = oh.fit_transform(df[['experience_level']])
cols=oh.get_feature_names_out()
df_experience_level = pd.DataFrame(data,columns=cols)
missing_values_count = df_experience_level.isnull().sum()
missing_values_count
output:
experience_level_EN 0
experience_level_EX 0
experience_level_MI 0
experience_level_SE 0
experience_level_nan 0
dtype: int64
The current code that I use is:
df.loc[df['experience_level'].isna(), 'experience_level_EN'] = np.nan
df.loc[df['experience_level'].isna(), 'experience_level_EX'] = np.nan
df.loc[df['experience_level'].isna(), 'experience_level_MI'] = np.nan
df.loc[df['experience_level'].isna(), 'experience_level_SE'] = np.nan
However this is tedious.
Running the obvious:
df.loc[df['experience_level'].isna(), df_experience_level] = np.nan
Results in:
ValueError: Index data must be 1-dimensional
Is there any way to transfer the NaNs from the parent variable to each dummy variable in a single statement?
You can try with str.startswith
to select the correct column subset:
# Obviously replace exp_lvl with experience_level
subset = df_exp_lvl.columns.str.startswith('exp_lvl')
df_exp_lvl.loc[df['exp_lvl'].isna(), subset] = np.nan
Output:
>>> df_exp_lvl
exp_lvl_EN exp_lvl_EX exp_lvl_MI exp_lvl_SE exp_lvl_nan
0 1.0 0.0 0.0 0.0 0.0
1 0.0 0.0 1.0 0.0 0.0
2 1.0 0.0 0.0 0.0 0.0
3 1.0 0.0 0.0 0.0 0.0
4 0.0 0.0 0.0 1.0 0.0
5 NaN NaN NaN NaN NaN
6 0.0 1.0 0.0 0.0 0.0
7 0.0 1.0 0.0 0.0 0.0
8 NaN NaN NaN NaN NaN
9 0.0 0.0 0.0 1.0 0.0
10 0.0 0.0 1.0 0.0 0.0
11 0.0 0.0 1.0 0.0 0.0
12 1.0 0.0 0.0 0.0 0.0
13 NaN NaN NaN NaN NaN
14 0.0 1.0 0.0 0.0 0.0
>>> df
exp_lvl
0 EN
1 MI
2 EN
3 EN
4 SE
5 NaN
6 EX
7 EX
8 NaN
9 SE
10 MI
11 MI
12 EN
13 NaN
14 EX