If we look at the source code, make_pipeline()
creates a Pipeline
object, so they are equivalent. As mentioned by @Mikhail Korobov, the only difference is that make_pipeline()
doesn't admit estimator names and they are set to the lowercase of their types. In other words, type(estimator).__name__.lower()
is used to create estimator names (source). So it's really a simpler form of building a pipeline.
On a related note, to get parameter names you can use get_params()
. This is useful if you want to know the parameter names for GridSearch()
. The parameter names are created by concatenating the estimator names with their kwargs recursively (e.g. max_iter
of a LogisticRegression()
is stored as 'logisticregression__max_iter'
or C
parameter in OneVsRestClassifier(LogisticRegression())
as 'onevsrestclassifier__estimator__C'
; the latter because when written using kwargs, it is OneVsRestClassifier(estimator=LogisticRegression())
).
from sklearn.pipeline import make_pipeline, Pipeline
from sklearn.decomposition import PCA
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.datasets import make_classification
X, y = make_classification()
pipe = make_pipeline(PCA(), LogisticRegression())
print(pipe.get_params())
# {'memory': None,
# 'steps': [('pca', PCA()), ('logisticregression', LogisticRegression())],
# 'verbose': False,
# 'pca': PCA(),
# 'logisticregression': LogisticRegression(),
# 'pca__copy': True,
# 'pca__iterated_power': 'auto',
# 'pca__n_components': None,
# 'pca__n_oversamples': 10,
# 'pca__power_iteration_normalizer': 'auto',
# 'pca__random_state': None,
# 'pca__svd_solver': 'auto',
# 'pca__tol': 0.0,
# 'pca__whiten': False,
# 'logisticregression__C': 1.0,
# 'logisticregression__class_weight': None,
# 'logisticregression__dual': False,
# 'logisticregression__fit_intercept': True,
# 'logisticregression__intercept_scaling': 1,
# 'logisticregression__l1_ratio': None,
# 'logisticregression__max_iter': 100,
# 'logisticregression__multi_class': 'auto',
# 'logisticregression__n_jobs': None,
# 'logisticregression__penalty': 'l2',
# 'logisticregression__random_state': None,
# 'logisticregression__solver': 'lbfgs',
# 'logisticregression__tol': 0.0001,
# 'logisticregression__verbose': 0,
# 'logisticregression__warm_start': False}
# use the params from above to construct param_grid
param_grid = {'pca__n_components': [2, None], 'logisticregression__C': [0.1, 1]}
gs = GridSearchCV(pipe, param_grid)
gs.fit(X, y)
best_score = gs.score(X, y)
Circling back to Pipeline
vs make_pipeline
; Pipeline
gives you more flexibility in naming parameters but if you name each estimator using lowercase of its type, then Pipeline
and make_pipeline
they will both have the same params and steps attributes.
pca = PCA()
lr = LogisticRegression()
make_pipe = make_pipeline(pca, lr)
pipe = Pipeline([('pca', pca), ('logisticregression', lr)])
make_pipe.get_params() == pipe.get_params() # True
make_pipe.steps == pipe.steps # True
LogisticRegression()
's estimator islogisticregression
? I had to set a grid search forOneVsRestClassifier(LinearSVC())
but I don't know what name refers to it. – Fenner