Here is basic code for training a model in TPOT:
from tpot import TPOTClassifier
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split
digits = load_digits()
X_train, X_test, y_train, y_test = train_test_split(digits.data, digits.target,
train_size=0.75, test_size=0.25, random_state=42)
tpot = TPOTClassifier(generations=5, population_size=50, verbosity=2, random_state=42)
tpot.fit(X_train, y_train)
print(tpot.score(X_test, y_test))
In the end, it scores the data on the test set without explicitly doing the transformations that were done on the training set. A few questions here.
Please educate if I'm completely misunderstanding this, please. Thank you.
Does the "tpot" model object automatically apply any scaling or other transformations when .score or .predict is called on new out-of-sample data?
That depends on the final pipeline that TPOT chose. However, if the final pipeline that TPOT chose has any sort of data scaling or transformation, then it correctly applies those scaling and transformation operations in the predict
and score
functions as well.
The reason for this is because, under the hood, TPOT is optimizing scikit-learn Pipeline objects.
That said, if there are specific transformations to your data that you want to guarantee happen with your data, then you have a couple options:
You can split your data into training and test, learn the transformation (e.g., StandardScaler
) on the training set, then also apply it to your test set. You would do both of these operations before ever passing the data to TPOT.
You can make use of TPOT's template functionality, which allows you to specify constraints on what the analysis pipeline should look like.