pythonpython-3.xmachine-learningtpot

Explanation of pipeline generated by tpot


I was using tpotClassifier() and got the following pipeline as my optimal pipeline. I am attaching my pipeline code which I got. Can someone explain the pipeline processes and order?

import numpy as np
import pandas as pd
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.feature_selection import SelectFwe, f_classif
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline, make_union
from tpot.builtins import StackingEstimator
from sklearn.preprocessing import FunctionTransformer
from copy import copy

tpot_data = pd.read_csv('PATH/TO/DATA/FILE', sep='COLUMN_SEPARATOR', dtype=np.float64)
features = tpot_data.drop('target', axis=1)
                         training_features, testing_features, training_target, testing_target = \
                         train_test_split(features, tpot_data['target'], random_state=None)

exported_pipeline = make_pipeline(
make_union(
    FunctionTransformer(copy),
    make_union(
        FunctionTransformer(copy),
        make_union(
            FunctionTransformer(copy),
            make_union(
                FunctionTransformer(copy),
                FunctionTransformer(copy)
            )
        )
    )
),
SelectFwe(score_func=f_classif, alpha=0.049),
ExtraTreesClassifier(bootstrap=False, criterion="entropy", max_features=1.0, min_samples_leaf=2, min_samples_split=5, n_estimators=100)
)

exported_pipeline.fit(training_features, training_target)
results = exported_pipeline.predict(testing_features)

Solution

  • make_union just unions multiple datasets, and FunctionTransformer(copy) duplicates all the columns. So the nested make_union and FunctionTransformer(copy) makes several copies of each feature. That seems very odd, except that with ExtraTreesClassifier it will have an effect of "bootstrapping" the feature selections. See also Issue 581 for an explanation for why these are generated in the first place; basically, adding copies is useful in stacking ensembles, and the genetic algorithm used by TPOT means it needs to generate those first before exploring such ensembles. There it is recommended that doing more iterations of the genetic algorithm may clean up such artifacts.

    After that things are straightforward, I guess: you perform a univariate feature selection, and fit an extra-random trees classifier.