Using statsmodel estimations with scikit-learn cross validation, is it possible?
Asked Answered
C

4

39

I am looking for a way I can use the fit object (result) obtained from python statsmodel to feed into cross_val_score of scikit-learn cross_validation method?

The attached link suggests that it may be possible but I have not succeeded.

I am getting the following error

estimator should a be an estimator implementing 'fit' method
statsmodels.discrete.discrete_model.BinaryResultsWrapper object at 
0x7fa6e801c590 was passed

Refer this link

Censor answered 8/12, 2016 at 17:51 Comment(1)
I think this is correct: #75536232Sore
T
51

Indeed, you cannot use cross_val_score directly on statsmodels objects, because of different interface: in statsmodels

  • training data is passed directly into the constructor
  • a separate object contains the result of model estimation

However, you can write a simple wrapper to make statsmodels objects look like sklearn estimators:

import statsmodels.api as sm
from sklearn.base import BaseEstimator, RegressorMixin

class SMWrapper(BaseEstimator, RegressorMixin):
    """ A universal sklearn-style wrapper for statsmodels regressors """
    def __init__(self, model_class, fit_intercept=True):
        self.model_class = model_class
        self.fit_intercept = fit_intercept
    def fit(self, X, y):
        if self.fit_intercept:
            X = sm.add_constant(X)
        self.model_ = self.model_class(y, X)
        self.results_ = self.model_.fit()
        return self
    def predict(self, X):
        if self.fit_intercept:
            X = sm.add_constant(X)
        return self.results_.predict(X)

This class contains correct fit and predict methods, and can be used with sklearn, e.g. cross-validated or included into a pipeline. Like here:

from sklearn.datasets import make_regression
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LinearRegression

X, y = make_regression(random_state=1, n_samples=300, noise=100)

print(cross_val_score(SMWrapper(sm.OLS), X, y, scoring='r2'))
print(cross_val_score(LinearRegression(), X, y, scoring='r2'))

You can see that the output of two models is identical, because they are both OLS models, cross-validated in the same way.

[0.28592315 0.37367557 0.47972639]
[0.28592315 0.37367557 0.47972639]
Turne answered 23/2, 2018 at 14:6 Comment(4)
I got nan for my cross_val_score() with the wrapper. Any ideas what could be the cause?Daven
you are reinitializing the models inside cross_val_score everytime, I think it should be outside the cross_val_scoreSamaria
the moment of initialization does not affect the resultTurne
I also got a list of NaN because I passed an instantiated version of sm.OLS(y, X) to SMWrapper(). It only works if you pass it SMWrapper(sm.OLS) exactly as shown above.Mikemikel
B
13

Following the suggestion of David (which gave me an error, complaining about missing function get_parameters) and the scikit learn documentation, I created the following wrapper for a linear regression. It has the same interface of sklearn.linear_model.LinearRegression but in addition has also the function summary(), which gives the info about p-values, R2 and other statistics, as in statsmodels.OLS.

import statsmodels.api as sm
from sklearn.base import BaseEstimator, RegressorMixin
import pandas as pd
import numpy as np

from sklearn.utils.multiclass import check_classification_targets
from sklearn.utils.validation import check_X_y, check_is_fitted, check_array
from sklearn.utils.multiclass import unique_labels
from sklearn.utils.estimator_checks import check_estimator



class MyLinearRegression(BaseEstimator, RegressorMixin):
    def __init__(self, fit_intercept=True):

        self.fit_intercept = fit_intercept


    """
    Parameters
    ------------
    column_names: list
            It is an optional value, such that this class knows 
            what is the name of the feature to associate to 
            each column of X. This is useful if you use the method
            summary(), so that it can show the feature name for each
            coefficient
    """ 
    def fit(self, X, y, column_names=() ):

        if self.fit_intercept:
            X = sm.add_constant(X)

        # Check that X and y have correct shape
        X, y = check_X_y(X, y)


        self.X_ = X
        self.y_ = y

        if len(column_names) != 0:
            cols = column_names.copy()
            cols = list(cols)
            X = pd.DataFrame(X)
            cols = column_names.copy()
            cols.insert(0,'intercept')
            print('X ', X)
            X.columns = cols

        self.model_ = sm.OLS(y, X)
        self.results_ = self.model_.fit()
        return self



    def predict(self, X):
        # Check is fit had been called
        check_is_fitted(self, 'model_')

        # Input validation
        X = check_array(X)

        if self.fit_intercept:
            X = sm.add_constant(X)
        return self.results_.predict(X)


    def get_params(self, deep = False):
        return {'fit_intercept':self.fit_intercept}


    def summary(self):
        print(self.results_.summary() )

Example of use:

cols = ['feature1','feature2']
X_train = df_train[cols].values
X_test = df_test[cols].values
y_train = df_train['label']
y_test = df_test['label']
model = MyLinearRegression()
model.fit(X_train, y_train)
model.summary()
model.predict(X_test)

If you want to show the names of the columns, you can call

model.fit(X_train, y_train, column_names=cols)

To use it in cross_validation:

from sklearn.model_selection import cross_val_score
scores = cross_val_score(MyLinearRegression(), X_train, y_train, cv=10, scoring='neg_mean_squared_error')
scores
Barbell answered 14/2, 2020 at 23:37 Comment(2)
In the last comment "To use it in cross_validation", why are you using X_train and y_train in cross_val_score instead of just X and y ?Dumas
Because I consider the following protocol: (i) Divide the samples in training and test set (ii) Select the best model, i.e., the one giving the highest cross-validation-score, JUST USING the training set, to avoid any data leaks (iii) Check the performance of such a model on the "unseen" data contained in the test set. If you used the entire set for cross-validation, you would select the model based on the same data on which you then judge the model. This would technically be a data-leak. Indeed, it would not give you an indication of how your model behaves with completely unseen data.Barbell
C
6

For reference purpose, if you use the statsmodels formula API and/or use the fit_regularized method, you can modify @David Dale's wrapper class in this way.

import pandas as pd
from sklearn.base import BaseEstimator, RegressorMixin
from statsmodels.formula.api import glm as glm_sm

# This is an example wrapper for statsmodels GLM
class SMWrapper(BaseEstimator, RegressorMixin):
    def __init__(self, family, formula, alpha, L1_wt):
        self.family = family
        self.formula = formula
        self.alpha = alpha
        self.L1_wt = L1_wt
        self.model = None
        self.result = None
    def fit(self, X, y):
        data = pd.concat([pd.DataFrame(X), pd.Series(y)], axis=1)
        data.columns = X.columns.tolist() + ['y']
        self.model = glm_sm(self.formula, data, family=self.family)
        self.result = self.model.fit_regularized(alpha=self.alpha, L1_wt=self.L1_wt, refit=True)
        return self.result
    def predict(self, X):
        return self.result.predict(X)
Carping answered 29/11, 2019 at 7:30 Comment(0)
B
-1

Though I think this is not technically scikit-learn, there is the package pmdarima (link to pmdarima package on PyPi) that wraps statsmodel and provides a scikit-learn like interface.

Binghi answered 10/1, 2020 at 15:53 Comment(2)
Hello, Andre. Please consider adding more information in your answer instead of linking to a external source.Nolan
please summarize the content of the link, in case of link rotJochbed

© 2022 - 2024 — McMap. All rights reserved.