In short: My columns are different between train set and test set after imputing.
Code of making train, test dataset
random_state_value = 0
#Define target
X = data.drop(columns = 'income', axis=1)
y = data['income']
#Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state = random_state_value)
#Impute missing data
imputer_cat = SimpleImputer(strategy = 'most_frequent')
imputer_num = SimpleImputer(strategy = 'median')
X_train[['workclass', 'occupation', 'native-country']] = imputer_cat.fit_transform(X_train[['workclass', 'occupation', 'native-country']])
X_train[['age']] = imputer_num.fit_transform(X_train[['age']])
X_test[['workclass', 'occupation', 'native-country']] = imputer_cat.fit_transform(X_test[['workclass', 'occupation', 'native-country']])
X_test[['age']] = imputer_num.fit_transform(X_test[['age']])
#Create dummy vars
X_train = pd.get_dummies(X_train, columns=['workclass', 'education', 'marital-status',
'occupation', 'relationship', 'race', 'gender', 'native-country'], drop_first = True)
X_test = pd.get_dummies(X_test, columns=['workclass', 'education', 'marital-status',
'occupation', 'relationship', 'race', 'gender', 'native-country'], drop_first = True)
y_train = pd.get_dummies(y_train, columns='income', drop_first = True)
y_test = pd.get_dummies(y_test, columns='income', drop_first = True)
y_test = y_test.values.ravel()
y_train = y_train.values.ravel()
I had categorical variables which had missing values. This is what I have done.
1. split the data into train, test set
2. impute each value in train and test set
3. make dummy variables for categorical variables
But then some columns have disappeared and the length of X_test and X_train is different.
temp_test = X_test.columns.sort_values()
temp_train = X_train.columns.sort_values()
[col for col in temp_train if col not in temp_test]
These are the columns.
Why does this happen? And how can I fix this problem?
When encoding categorical variables, you need to be careful, if you decide to use pd.get_dummies
. After you use it for your training data, it's not trivial to keep the same encoding it came up with on your test data. That is, if on your training data you had categorical values in a gender
column, such as ["female", "male"]
, and it encoded it as [0, 1]
respectively, there is no guarantee that it will do the same if you run it on your test data with the same categorical values.
On the other hand, if you have categorical values that only appear on your training set, pd.get_dummies
will naturally only encode those and create the respective new columns, that is, ["gender_male", "gender_female"]
. If coincidentally after making your train/test split, the training set only has the "male"
values, then your current code will create "gender_male"
for that DataFrame and the testing set will have a "gender_female"
column. Hence, both having different columns. Note that I purposely avoided the drop_first=True
conversation to make my point, but you might consider doing that, as discussed heavily in this StackOverflow post.
This post: Keep same dummy variable in training and testing data also goes over this topic in detail.
The following example aims to exemplify this with some made-up data (since we don't have access to yours):
import numpy as np
import pandas as pd
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split
# Generate sample data
np.random.seed(42)
num_rows = 1000
data = pd.DataFrame(
{
"occupation": np.random.choice(
["Tech-support", "Priv-house-serv", "Protective-serv", "Armed-Forces"], num_rows
),
"race": np.random.choice(["White", "Asian-Pac-Islander", "Amer-Indian-Eskimo", "Other", "Black"], num_rows),
"gender": np.random.choice(["Male", "Female"], num_rows),
"native-country": np.random.choice(["United-States", "Cambodia", "England", "Puerto-Rico"], num_rows),
"age": np.random.randint(18, 81, num_rows),
"income": np.where(np.random.rand(num_rows) < 0.25, ">50K", "<=50K"),
}
)
# Split the data into train and test sets
X = data.drop(columns="income")
y = data["income"]
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# Simulate a scenario in which "Other" is coincidentally missing from
# the test set
X_test[X_test["race"] == "Other"] = np.nan
# Create dummy variables on the train and test sets separately
X_train = pd.get_dummies(
X_train,
columns=["occupation", "race", "gender", "native-country"],
drop_first=False,
)
X_test = pd.get_dummies(
X_test,
columns=["occupation", "race", "gender", "native-country"],
drop_first=False,
)
print("X_train columns:", X_train.columns)
print("X_test columns:", X_test.columns)
# Test if the columns are the same
assert X_train.columns.equals(X_test.columns) # This will fail!
What you should do instead is to use pd.get_dummies
on your whole dataset first and then train/test split!
This should avoid any columns being different issues. That is,
# Generate sample data
data = ...
# Split the data into train and test sets
X = data.drop(columns="income")
y = data["income"]
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# Simulate a scenario in which "Other" is coincidentally missing from the test set
X_test[X_test["race"] == "Other"] = np.nan
# Combine the train and test sets before creating dummy variables
# (Note that we can just do this on X first, then train/test split, but it helps me
# make my point regarding categorical values missing in X_test)
X_all = pd.concat([X_train, X_test], ignore_index=True)
X_all = pd.get_dummies(
X_all,
columns=["occupation", "race", "gender", "native-country"],
drop_first=False,
)
# Split the data back into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
X_all, y, test_size=0.2, random_state=42
)
print("X_train columns:", X_train.columns)
print("X_test columns:", X_test.columns)
# Test if the columns are the same
assert X_train.columns.equals(X_test.columns) # This won't fail!
Finally, I recommend using scikit-learn's OneHotEncoder and your code would look like something like this with the added benefit of saving how exactly it did the encoding to the categorical values:
# Simulate a scenario in which "Other" is coincidentally missing from the test set
# This time doing it on the original dataset, X
X[X["race"] == "Other"] = np.nan
original_columns = X.columns
# Create dummy variables using OneHotEncoder
encoder = OneHotEncoder(handle_unknown="ignore", sparse_output=False)
X = encoder.fit_transform(X)
# Convert the encoded data to DataFrames
X = pd.DataFrame(X, columns=encoder.get_feature_names_out(original_columns))
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
print("X_train columns:", X_train_encoded.columns)
print("X_test columns:", X_test_encoded.columns)
# Test if the columns are the same
# This also won't fail!
assert X_train_encoded.columns.equals(X_test_encoded.columns)
You can read up more about it here.