[SOLVED] Data Leakage during categorical variable handling?

Data Leakage during categorical variable handling?

I am fairly new to machine learning. I came across the concept of Data Leakage. The article says that always split the data before performing preprocessing steps.

My question is, do steps such as discretization, grouping categories to a single category to reduce cardinality, converting categorical variables to binary variables, etc. lead to Data Leakage?

Should I split the data to train and test set before applying these steps?

Also, which are the main preprocessing steps I really need to be cautious of in order to avoid data leakage?

Solution

This is a very interesting topic and I'll try to keep it simple and brief. Data Leakage is the state where you have a ML model trained on predictors that are not found in real or production environments. If you do a pre-processing step on your training data, then you should do the same step on your testing data in order to make predictions, but this does not cause data leakage. Some ML libraries do this for you, like R's recipes from tidymodels.

Answering your question, you shouldn't be afraid of data leakage when performing pre-processing and feature engineering your data, but way before, when you define your problem and the data you will use to train the model that attemps to solve it. Here is an example I have faced repetedly in practice:

Supposse you are fitting a ML model to predict sales of some product in your company, one year into the future. To achieve this, you use the product's historical data and data of complementary and substitute goods. The model's training and testing performance is great and you plan to move the model to production, but you have a huge problem: the complementary and substitute products data is not going to be able until one year has happened. By that time, making a one year horizon prediction makes no sense, because you'll already have observed the sales data.

In conclusion, this case of data leakage can be prevented by forecasting your independent variables, or just using time series models that don't need more variables than the response. This is only one case of data leakage, but you can find more on Max Kuhn's great book "Feature Engineering and Selection: a practical approach for predictive models".