What is the difference between pipeline and make_pipeline in scikit-learn?

Asked 20/11, 2016 at 18:56 Answered 8/7, 2024 at 5:32

Solved python machine-learning scikit-learn pipeline

I got this from the sklearn webpage:

Pipeline: Pipeline of transforms with a final estimator
Make_pipeline: Construct a Pipeline from the given estimators. This is a shorthand for the Pipeline constructor.

But I still do not understand when I have to use each one. Can anyone give me an example?

Harriettharrietta answered 20/11, 2016 at 18:56 Comment(0)

142

The only difference is that make_pipeline generates names for steps automatically.

Step names are needed e.g. if you want to use a pipeline with model selection utilities (e.g. GridSearchCV). With grid search you need to specify parameters for various steps of a pipeline:

pipe = Pipeline([('vec', CountVectorizer()), ('clf', LogisticRegression()])
param_grid = [{'clf__C': [1, 10, 100, 1000]}
gs = GridSearchCV(pipe, param_grid)
gs.fit(X, y)

compare it with make_pipeline:

pipe = make_pipeline(CountVectorizer(), LogisticRegression())     
param_grid = [{'logisticregression__C': [1, 10, 100, 1000]}
gs = GridSearchCV(pipe, param_grid)
gs.fit(X, y)

So, with Pipeline:

names are explicit, you don't have to figure them out if you need them;
name doesn't change if you change estimator/transformer used in a step, e.g. if you replace LogisticRegression() with LinearSVC() you can still use clf__C.

make_pipeline:

shorter and arguably more readable notation;
names are auto-generated using a straightforward rule (lowercase name of an estimator).

When to use them is up to you :) I prefer make_pipeline for quick experiments and Pipeline for more stable code; a rule of thumb: IPython Notebook -> make_pipeline; Python module in a larger project -> Pipeline. But it is certainly not a big deal to use make_pipeline in a module or Pipeline in a short script or a notebook.

Frances answered 20/11, 2016 at 19:28 Comment(3)

Could you tell me where it is documented that the name of LogisticRegression()'s estimator is logisticregression? I had to set a grid search for OneVsRestClassifier(LinearSVC()) but I don't know what name refers to it. – Fenner 14/9, 2019 at 18:0

@Fenner it is documented at scikit-learn.org/stable/modules/generated/… - "their names will be set to the lowercase of their types automatically" – Frances 16/9, 2019 at 13:38

But what about OneVsRestClassifier(LinearSVC()), I have tried all of the following:

'onevsrestclassifier_linearsvc__C',   onevsrestclassifier_linearsvc_estimator__C', 'onevsrestclassifier__C', 						'linearsvc__C', 'onevsrestclassifier__linearsvc__C', 'onevsrestclassifier-linearsvc__C',  'onevsrestclassifier_linearsvc_estimator__C', 'estimator__C'

, they all give me Check the list of available parameters with "estimator.get_params().keys()". – Fenner 16/9, 2019 at 15:34

If we look at the source code, make_pipeline() creates a Pipeline object, so they are equivalent. As mentioned by @Mikhail Korobov, the only difference is that make_pipeline() doesn't admit estimator names and they are set to the lowercase of their types. In other words, type(estimator).__name__.lower() is used to create estimator names (source). So it's really a simpler form of building a pipeline.

On a related note, to get parameter names you can use get_params(). This is useful if you want to know the parameter names for GridSearch(). The parameter names are created by concatenating the estimator names with their kwargs recursively (e.g. max_iter of a LogisticRegression() is stored as 'logisticregression__max_iter' or C parameter in OneVsRestClassifier(LogisticRegression()) as 'onevsrestclassifier__estimator__C'; the latter because when written using kwargs, it is OneVsRestClassifier(estimator=LogisticRegression())).

from sklearn.pipeline import make_pipeline, Pipeline
from sklearn.decomposition import PCA
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.datasets import make_classification

X, y = make_classification()
pipe = make_pipeline(PCA(), LogisticRegression())

print(pipe.get_params())

# {'memory': None,
#  'steps': [('pca', PCA()), ('logisticregression', LogisticRegression())],
#  'verbose': False,
#  'pca': PCA(),
#  'logisticregression': LogisticRegression(),
#  'pca__copy': True,
#  'pca__iterated_power': 'auto',
#  'pca__n_components': None,
#  'pca__n_oversamples': 10,
#  'pca__power_iteration_normalizer': 'auto',
#  'pca__random_state': None,
#  'pca__svd_solver': 'auto',
#  'pca__tol': 0.0,
#  'pca__whiten': False,
#  'logisticregression__C': 1.0,
#  'logisticregression__class_weight': None,
#  'logisticregression__dual': False,
#  'logisticregression__fit_intercept': True,
#  'logisticregression__intercept_scaling': 1,
#  'logisticregression__l1_ratio': None,
#  'logisticregression__max_iter': 100,
#  'logisticregression__multi_class': 'auto',
#  'logisticregression__n_jobs': None,
#  'logisticregression__penalty': 'l2',
#  'logisticregression__random_state': None,
#  'logisticregression__solver': 'lbfgs',
#  'logisticregression__tol': 0.0001,
#  'logisticregression__verbose': 0,
#  'logisticregression__warm_start': False}

# use the params from above to construct param_grid
param_grid = {'pca__n_components': [2, None], 'logisticregression__C': [0.1, 1]}
gs = GridSearchCV(pipe, param_grid)
gs.fit(X, y)

best_score = gs.score(X, y)

Circling back to Pipeline vs make_pipeline; Pipeline gives you more flexibility in naming parameters but if you name each estimator using lowercase of its type, then Pipeline and make_pipeline they will both have the same params and steps attributes.

pca = PCA()
lr = LogisticRegression()
make_pipe = make_pipeline(pca, lr)
pipe = Pipeline([('pca', pca), ('logisticregression', lr)])

make_pipe.get_params() == pipe.get_params()   # True
make_pipe.steps == pipe.steps                 # True

Habanera answered 15/5, 2023 at 2:13 Comment(0)

In scikit-learn, both Pipeline and make_pipeline are used to create a sequence of transformations and estimators that can be treated as a single unit.

Pipeline: This requires you to explicitly name each step in the sequence.

make_pipeline: This automatically assigns names to each step based on the class names of the estimators.

# Define the pipeline with explicitly named steps
pipeline = Pipeline([
    ('scaler', StandardScaler()),  # Explicitly name the step 'scaler'
    ('pca', PCA(n_components=2))   # Explicitly name the step 'pca'
])

here in this example we explicitly name each step: 'scaler' for StandardScaler and 'pca' for PCA. This can be useful for referencing specific steps later, especially when you need to access or modify them.

# Define the pipeline using make_pipeline
pipeline = make_pipeline(
    StandardScaler(),  # No need to explicitly name the step
    PCA(n_components=2)  # No need to explicitly name the step
)

In this example, make_pipeline automatically names the steps based on their class names, so you don't need to specify names manually. The steps will be named 'standardscaler' and 'pca' respectively.

Both approaches will yield the same result in terms of the transformations applied to the data.

Use Pipeline when you need more control over the names of the steps, especially if you plan to modify or access specific steps later.
Use make_pipeline for quicker and cleaner code when you don't need to explicitly name each step.

Fjeld answered 8/7, 2024 at 5:32 Comment(0)

Recommended topics

Hot tags