pythonmachine-learningscikit-learn

Columns are missing after imputing and creating dummy variables. How should I fix this?


In short: My columns are different between train set and test set after imputing.

Code of making train, test dataset

random_state_value = 0

#Define target
X = data.drop(columns = 'income', axis=1)
y = data['income']

#Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state = random_state_value)
#Impute missing data
imputer_cat = SimpleImputer(strategy = 'most_frequent')
imputer_num = SimpleImputer(strategy = 'median')

X_train[['workclass', 'occupation', 'native-country']] = imputer_cat.fit_transform(X_train[['workclass', 'occupation', 'native-country']])
X_train[['age']] = imputer_num.fit_transform(X_train[['age']])

X_test[['workclass', 'occupation', 'native-country']] = imputer_cat.fit_transform(X_test[['workclass', 'occupation', 'native-country']])
X_test[['age']] = imputer_num.fit_transform(X_test[['age']])
#Create dummy vars
X_train = pd.get_dummies(X_train, columns=['workclass', 'education', 'marital-status', 
                                     'occupation', 'relationship', 'race', 'gender', 'native-country'], drop_first = True)
X_test = pd.get_dummies(X_test, columns=['workclass', 'education', 'marital-status', 
                                     'occupation', 'relationship', 'race', 'gender', 'native-country'], drop_first = True)

y_train = pd.get_dummies(y_train, columns='income', drop_first = True)
y_test = pd.get_dummies(y_test, columns='income', drop_first = True)
y_test = y_test.values.ravel()
y_train = y_train.values.ravel()

I had categorical variables which had missing values. This is what I have done.

1. split the data into train, test set

2. impute each value in train and test set

3. make dummy variables for categorical variables

But then some columns have disappeared and the length of X_test and X_train is different.

length not matching

lost columns

temp_test = X_test.columns.sort_values()
temp_train = X_train.columns.sort_values()

[col for col in temp_train if col not in temp_test]

These are the columns.

Why does this happen? And how can I fix this problem?


Solution

  • When encoding categorical variables, you need to be careful, if you decide to use pd.get_dummies. After you use it for your training data, it's not trivial to keep the same encoding it came up with on your test data. That is, if on your training data you had categorical values in a gender column, such as ["female", "male"], and it encoded it as [0, 1] respectively, there is no guarantee that it will do the same if you run it on your test data with the same categorical values.

    On the other hand, if you have categorical values that only appear on your training set, pd.get_dummies will naturally only encode those and create the respective new columns, that is, ["gender_male", "gender_female"]. If coincidentally after making your train/test split, the training set only has the "male" values, then your current code will create "gender_male" for that DataFrame and the testing set will have a "gender_female" column. Hence, both having different columns. Note that I purposely avoided the drop_first=True conversation to make my point, but you might consider doing that, as discussed heavily in this StackOverflow post.

    This post: Keep same dummy variable in training and testing data also goes over this topic in detail.

    The following example aims to exemplify this with some made-up data (since we don't have access to yours):

    import numpy as np
    import pandas as pd
    from sklearn.impute import SimpleImputer
    from sklearn.model_selection import train_test_split
    
    # Generate sample data
    np.random.seed(42)
    num_rows = 1000
    
    data = pd.DataFrame(
        {
            "occupation": np.random.choice(
                ["Tech-support", "Priv-house-serv", "Protective-serv", "Armed-Forces"], num_rows
            ),
            "race": np.random.choice(["White", "Asian-Pac-Islander", "Amer-Indian-Eskimo", "Other", "Black"], num_rows),
            "gender": np.random.choice(["Male", "Female"], num_rows),
            "native-country": np.random.choice(["United-States", "Cambodia", "England", "Puerto-Rico"], num_rows),
            "age": np.random.randint(18, 81, num_rows),
            "income": np.where(np.random.rand(num_rows) < 0.25, ">50K", "<=50K"),
        }
    )
    
    # Split the data into train and test sets
    X = data.drop(columns="income")
    y = data["income"]
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=42
    )
    
    # Simulate a scenario in which "Other" is coincidentally missing from 
    # the test set
    X_test[X_test["race"] == "Other"] = np.nan
    
    # Create dummy variables on the train and test sets separately
    X_train = pd.get_dummies(
        X_train,
        columns=["occupation", "race", "gender", "native-country"],
        drop_first=False,
    )
    X_test = pd.get_dummies(
        X_test,
        columns=["occupation", "race", "gender", "native-country"],
        drop_first=False,
    )
    
    print("X_train columns:", X_train.columns)
    print("X_test columns:", X_test.columns)
    
    # Test if the columns are the same
    assert X_train.columns.equals(X_test.columns)  # This will fail!
    

    What you should do instead is to use pd.get_dummies on your whole dataset first and then train/test split!

    This should avoid any columns being different issues. That is,

    # Generate sample data
    data = ...
    
    # Split the data into train and test sets
    X = data.drop(columns="income")
    y = data["income"]
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=42
    )
    
    # Simulate a scenario in which "Other" is coincidentally missing from the test set
    X_test[X_test["race"] == "Other"] = np.nan
    
    # Combine the train and test sets before creating dummy variables
    # (Note that we can just do this on X first, then train/test split, but it helps me
    #  make my point regarding categorical values missing in X_test)
    X_all = pd.concat([X_train, X_test], ignore_index=True)
    X_all = pd.get_dummies(
        X_all,
        columns=["occupation", "race", "gender", "native-country"],
        drop_first=False,
    )
    
    # Split the data back into train and test sets
    X_train, X_test, y_train, y_test = train_test_split(
        X_all, y, test_size=0.2, random_state=42
    )
    
    print("X_train columns:", X_train.columns)
    print("X_test columns:", X_test.columns)
    
    # Test if the columns are the same
    assert X_train.columns.equals(X_test.columns)  # This won't fail!
    

    Finally, I recommend using scikit-learn's OneHotEncoder and your code would look like something like this with the added benefit of saving how exactly it did the encoding to the categorical values:

    # Simulate a scenario in which "Other" is coincidentally missing from the test set
    # This time doing it on the original dataset, X
    X[X["race"] == "Other"] = np.nan
    original_columns = X.columns
    
    # Create dummy variables using OneHotEncoder
    encoder = OneHotEncoder(handle_unknown="ignore", sparse_output=False)
    X = encoder.fit_transform(X)
    
    # Convert the encoded data to DataFrames
    X = pd.DataFrame(X, columns=encoder.get_feature_names_out(original_columns))
    
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=42
    )
    
    print("X_train columns:", X_train_encoded.columns)
    print("X_test columns:", X_test_encoded.columns)
    
    # Test if the columns are the same
    # This also won't fail!
    assert X_train_encoded.columns.equals(X_test_encoded.columns)
    

    You can read up more about it here.