pythonmachine-learningscikit-learnencoding

Leave one out encoding on test set with transform


Context: When preprocessing a data set using sklearn, you use fit_transform on the training set and transform on the test set, to avoid data leakage. Using leave one out (LOO) encoding, you need the target variable value to calculate the encoded value of a feature value. When using the LOO encoder in a pipeline, you can apply it to the training set using the fit_transform function, which accepts the features (X) and the target values (y).

How do I calculate the LOO encodings for the test set with the same pipeline, knowing that transform does not accept the target variable values as an argument? I'm quite confused about this. The transform function indeed transforms the columns but without considering the value of the target, since it doesn't have that information.


Solution

  • You shouldn't need the target variable of the test set while applying leave-one-out (or any other) encoding. Even if you somehow managed to pass it when you do your offline evaluations on the test set, how will you actually apply it during inference time? During inference time when your model is serving traffic from real users, obviously the true label wouldn't be available. And you should always compute your test metrics such that they are representative of what happens in the real world. So conceptually, it seems wrong to use the test labels to do feature encoding.

    I looked up the source code of leave-one-out encoding in the category_encoders package and it's apparent that they find the mean target without leaving the current example out when the target variable is not supplied

    # Replace level with its mean target; if level occurs only once, use global mean
    level_means = (colmap['sum'] / colmap['count']).where(level_notunique, self._mean)
    

    So if I would just use the encoder like this

    import category_encoders as ce
    from sklearn.model_selection import train_test_split
    import pandas as pd
    
    dataframe = pd.DataFrame({
        'f1': ['P', 'Q', 'P', 'Q', 'P', 'P', 'Q', 'Q'],
        'f2': ['M', 'N', 'M', 'N', 'M', 'N', 'M', 'N'],
        'f3': ['A', 'B', 'C', 'C', 'C', 'C', 'A', 'C'],
        'y': [1, 0, 1, 0, 1, 1, 0, 0]
    })
    
    train_data, test_data = train_test_split(dataframe, test_size=0.2)
    
    encoder = ce.LeaveOneOutEncoder(cols=['f1', 'f2', 'f3'])
    
    encoded_train = encoder.fit_transform(train_data, train_data['y'])
    encoded_test = encoder.transform(test_data)