pythonscikit-learnmodel-fitting

scikit-learn classifiers and regressors caching training data?


I have some 22,000 rows of training data. I use train_test_split to get training and testing data. I run fitting and then get some idea of how well fitting went using various methods or estimation.

I want to have the fitted model go back over the 22,000 rows and predict against them as if it had never seen them before. However when I do this the regressors or classifiers get every single row 100% correct, which cannot be right given that largely the best I can expect is 75% etc.

Do the estimators have some sort of learning data cache? How can I delete the cache but keep the trained model?


Solution

  • go back over the 22,000 rows and predict against them as if it had never seen them before

    This is not possible. It HAS seen them during training and was optimized to best fit to the data presented.

    There is no magical cache, but the model's learned parameters are deduced from what it saw during training. Worst case is, you have an overfitted model that gets 100% accuracy on training data without any generalisation because it had enough parameters or your dataset too little variation and your model just learned to reproduce your training data exactly.

    See scikits example page about the same topic which also has a plot for demonstration:

    Overfitting example from scikit

    note how the model fitted to the data points in the last panel has acurately learned to represent the training data, but is far away from a meaningful representation of the actual underlying true model