pythonmachine-learningscikit-learnpipeline

What is the difference between pipeline and make_pipeline in scikit-learn?


I got this from the sklearn webpage:

But I still do not understand when I have to use each one. Can anyone give me an example?


Solution

  • The only difference is that make_pipeline generates names for steps automatically.

    Step names are needed e.g. if you want to use a pipeline with model selection utilities (e.g. GridSearchCV). With grid search you need to specify parameters for various steps of a pipeline:

    pipe = Pipeline([('vec', CountVectorizer()), ('clf', LogisticRegression()])
    param_grid = [{'clf__C': [1, 10, 100, 1000]}
    gs = GridSearchCV(pipe, param_grid)
    gs.fit(X, y)
    

    compare it with make_pipeline:

    pipe = make_pipeline(CountVectorizer(), LogisticRegression())     
    param_grid = [{'logisticregression__C': [1, 10, 100, 1000]}
    gs = GridSearchCV(pipe, param_grid)
    gs.fit(X, y)
    

    So, with Pipeline:

    make_pipeline:

    When to use them is up to you :) I prefer make_pipeline for quick experiments and Pipeline for more stable code; a rule of thumb: IPython Notebook -> make_pipeline; Python module in a larger project -> Pipeline. But it is certainly not a big deal to use make_pipeline in a module or Pipeline in a short script or a notebook.