pythonmachine-learningscikit-learntpot

When fitting with TPOT CV, is the fitted_pipeline_ retrained on the whole dataset?


I am using a LeaveOutGroupOut CV strategy with TPOTRegressor

from tpot import TPOTRegressor
from sklearn.model_selection import LeaveOneGroupOut

tpot = TPOTRegressor(
    config_dict=regressor_config_dict,
    generations=100,
    population_size=100,
    cv=LeaveOneGroupOut(),
    verbosity=2,
    n_jobs=1)

tpot.fit(XX, yy, groups=groups)

After optimization the best scoring trained pipeline is stored in tpot.fitted_pipeline_ and tpot.fitted_pipeline_.predict(X) is available.

my question is: what will the fitted pipeline have been trained on? e.g.

Additionally, is there a way to access the complete set of trained models corresponding to the set of splits for the winning/optimized pipeline?


Solution

  • TPOT will fit the final 'best' pipeline on the full training set: code

    It's therefore recommended that your testing data never be passed to the TPOT fit function if you plan to directly interact with the 'best' pipeline via the TPOT object.

    If that is an issue for you, you can retrain the pipeline directly via the tpot.fitted_pipeline_ attribute, which is simply a sklearn Pipeline object. Alternatively, you can use the export function to export the 'best' pipeline to its corresponding Python code and interact with the pipeline outside of TPOT.

    Additionally, is there a way to access the complete set of trained models corresponding to the set of splits for the winning/optimized pipeline?

    No. TPOT uses sklearn's cross_val_score when evaluating pipelines, so it throws out the set of trained pipelines from the CV process. However, you can access the scoring results of every pipeline that TPOT evaluated via the tpot.evaluated_individuals_ attribute.