rmachine-learningrandom-forest

Struggling to understand complete predictive model process in R


I am trying to predict customer churn using a database of current and already churned customers. So far I have

  1. Taken complete customer database of current customers and already churned customers along with customer service variables etc to use to predict on.
  2. Split the data set randomly 70/30 into train and test
  3. Using R, I have trained a random forest model to predict make predictions and then compared to the actual status using a confusion matrix.
  4. I have ran that model using the test data to check accuracy for identifying the churners

What I want to do now is take all of our current customers and predict which ones will churn. Have I done this all wrong as alot of the current customers I need to predict if will churn have already been seen by the model as they appear in the training set?

Was I somehow supposed to use a training and test set that will not be part of the dataset I need to make predictions on?


Solution

  • As far as I have understood your question, I feel you want to know if you've done the right thing by using overlapping examples in your training and test set. You first need to understand that you need to keep your training set separate from your test set. Since your model parameters have been computed based on your training set, for similar examples in the test set, the model will give you the correct prediction, so your accuracy will definitely be positively impacted for those common training and test set examples but that is not the correct thing to do. Your test set should always contain previously unseen examples in order to properly evaluate the performance of your algorithm. If your current customers (on which you want to test your model) are already there in the training set, you would want to leave them out in the testing process. I'd suggest you perform a check between the training set customers and the current set of customers based on some unique identifier (if present) such as the Customer ID and leave common customers out of your fresh batch of unseen test examples.