Poisson Regression in statsmodels and R
Asked Answered
A

2

22

Given the some randomly generated data with

  • 2 columns,
  • 50 rows and
  • integer range between 0-100

With R, the poisson glm and diagnostics plot can be achieved as such:

> col=2
> row=50
> range=0:100
> df <- data.frame(replicate(col,sample(range,row,rep=TRUE)))
> model <- glm(X2 ~ X1, data = df, family = poisson)
> glm.diag.plots(model)

In Python, this would give me the line predictor vs residual plot:

import numpy as np
import pandas as pd
import statsmodels.formula.api
from statsmodels.genmod.families import Poisson
import seaborn as sns
import matplotlib.pyplot as plt

df = pd.DataFrame(np.random.randint(100, size=(50,2)))
df.rename(columns={0:'X1', 1:'X2'}, inplace=True)
glm = statsmodels.formula.api.gee
model = glm("X2 ~ X1", groups=None, data=df, family=Poisson())
results = model.fit()

And to plot the diagnostics in Python:

model_fitted_y = results.fittedvalues  # fitted values (need a constant term for intercept)
model_residuals = results.resid # model residuals
model_abs_resid = np.abs(model_residuals)  # absolute residuals


plot_lm_1 = plt.figure(1)
plot_lm_1.set_figheight(8)
plot_lm_1.set_figwidth(12)
plot_lm_1.axes[0] = sns.residplot(model_fitted_y, 'X2', data=df, lowess=True, scatter_kws={'alpha': 0.5}, line_kws={'color': 'red', 'lw': 1, 'alpha': 0.8})
plot_lm_1.axes[0].set_xlabel('Line Predictor')
plot_lm_1.axes[0].set_ylabel('Residuals')
plt.show()

But when I try to get the cook statistics,

# cook's distance, from statsmodels internals
model_cooks = results.get_influence().cooks_distance[0]

it threw an error saying:

AttributeError                            Traceback (most recent call last)
<ipython-input-66-0f2bedfa1741> in <module>()
      4 model_residuals = results.resid
      5 # normalized residuals
----> 6 model_norm_residuals = results.get_influence().resid_studentized_internal
      7 # absolute squared normalized residuals
      8 model_norm_residuals_abs_sqrt = np.sqrt(np.abs(model_norm_residuals))

/opt/conda/lib/python3.6/site-packages/statsmodels/base/wrapper.py in __getattribute__(self, attr)
     33             pass
     34 
---> 35         obj = getattr(results, attr)
     36         data = results.model.data
     37         how = self._wrap_attrs.get(attr)

AttributeError: 'GEEResults' object has no attribute 'get_influence'

Is there a way to plot out all 4 diagnostic plots in Python like in R?

How do I retrieve the cook statistics of the fitted model results in Python using statsmodels?

Amata answered 7/12, 2017 at 1:35 Comment(2)
outlier and influence measures are only available for OLS and maybe WLS. (It might not be difficult to use some GLM residuals, but it would need unit tests against R or Stata. GEE might be more difficult.)Phenomena
For some purposes, R is truly the king. While Python has minimal and shorter code than R, a lot of work is done in just a handful of commands in the latter language. I miss R's commands ;)Hibernicism
C
16

The generalized estimating equations API should give you a different result than R's GLM model estimation. To get similar estimates in statsmodels, you need to use something like:

import pandas as pd
import statsmodels.api as sm

# Read data generated in R using pandas or something similar
df = pd.read_csv(...) # file name goes here

# Add a column of ones for the intercept to create input X
X = np.column_stack( (np.ones((df.shape[0], 1)), df.X1) )

# Relabel dependent variable as y (standard notation)
y = df.X2

# Fit GLM in statsmodels using Poisson link function
sm.GLM(y, X, family = Poisson()).fit().summary()

EDIT -- Here is the rest of the answer on how to get Cook's distance in Poisson regression. This is a script I wrote based on some data generated in R. I compared my values against those in R calculated using the cooks.distance function and the values matched.

from __future__ import division, print_function

import numpy as np
import pandas as pd
import statsmodels.api as sm

PATH = '/Users/robertmilletich/test_reg.csv'


def _weight_matrix(fitted_model):
    """Calculates weight matrix in Poisson regression

    Parameters
    ----------
    fitted_model : statsmodel object
        Fitted Poisson model

    Returns
    -------
    W : 2d array-like
        Diagonal weight matrix in Poisson regression
    """
    return np.diag(fitted_model.fittedvalues)


def _hessian(X, W):
    """Hessian matrix calculated as -X'*W*X

    Parameters
    ----------
    X : 2d array-like
        Matrix of covariates

    W : 2d array-like
        Weight matrix

    Returns
    -------
    hessian : 2d array-like
        Hessian matrix
    """
    return -np.dot(X.T, np.dot(W, X))


def _hat_matrix(X, W):
    """Calculate hat matrix = W^(1/2) * X * (X'*W*X)^(-1) * X'*W^(1/2)

    Parameters
    ----------
    X : 2d array-like
        Matrix of covariates

    W : 2d array-like
        Diagonal weight matrix

    Returns
    -------
    hat : 2d array-like
        Hat matrix
    """
    # W^(1/2)
    Wsqrt = W**(0.5)

    # (X'*W*X)^(-1)
    XtWX     = -_hessian(X = X, W = W)
    XtWX_inv = np.linalg.inv(XtWX)

    # W^(1/2)*X
    WsqrtX = np.dot(Wsqrt, X)

    # X'*W^(1/2)
    XtWsqrt = np.dot(X.T, Wsqrt)

    return np.dot(WsqrtX, np.dot(XtWX_inv, XtWsqrt))


def main():

    # Load data and separate into X and y
    df = pd.read_csv(PATH)
    X  = np.column_stack( (np.ones((df.shape[0], 1)), df.X1 ) )
    y  = df.X2

    # Fit model
    model = sm.GLM(y, X, family=sm.families.Poisson()).fit()

    # Weight matrix
    W = _weight_matrix(model)

    # Hat matrix
    H   = _hat_matrix(X, W)
    hii = np.diag(H) # Diagonal values of hat matrix

    # Pearson residuals
    r = model.resid_pearson

    # Cook's distance (formula used by R = (res/(1 - hat))^2 * hat/(dispersion * p))
    # Note: dispersion is 1 since we aren't modeling overdispersion
    cooks_d = (r/(1 - hii))**2 * hii/(1*2)

if __name__ == "__main__":
    main()
Cramfull answered 2/1, 2018 at 21:31 Comment(0)
P
2

As an update here

statsmodels has now, since version 0.10, get_influence method also for GLMResults.

https://www.statsmodels.org/dev/examples/notebooks/generated/influence_glm_logit.html

for example:

Print influence and outlier measures for 10 observations with largest cook distance:

infl = res.get_influence(observed=False)
summ_df = infl.summary_frame()
summ_df.sort_values("cooks_d", ascending=False)[:10]

There are no combination plots, but influence plot infl.plot_influence() and index plot infl.plot_index(...) for any of the measures are available.

Generic influence measures for maximum likelihood models is or will become available for discrete and other models.

MLE influence measures are based on hessian, i.e. observed information matrix, while for GLM both expected information matrix and hessian versions are available. In GLM, the distinction is only relevant when non-canonical links are used.

Phenomena answered 18/10, 2021 at 0:31 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.