How to calculate Cooks Distance, DFFITS using python statsmodel
Asked Answered
D

1

5

I want to calculate Cooks_d and DFFITS in Python using statsmodel.

Here is my code in Python:

X = your_str_cleaned[param]
y = your_str_cleaned['Visitor']
X = sm.add_constant(X)
model = sm.OLS(y, X)
results = model.fit()

I tried using this for getting Cooks Distance and DFFITS:

import statsmodels.stats.outliers_influence as st_inf
st_inf.OLSInfluence.summary_frame(results)

But I am getting this error:

'OLSResults' object has no attribute 'results'.

Can someone help me find where I am going wrong?

Dressel answered 17/7, 2018 at 21:6 Comment(1)
As shown in the answer using get_influence is easier. In the case here the call argument is wrong. It should have results provided to OLSInfluence and not to summary, i.e. st_inf.OLSInfluence(results).summary_frame() should work.Kitchenware
M
7

I experience the same problem, so I had to find a way around. I don't have much experience, and this doesn't fix the root issue with OLSInfluence. But it gives you summary_frame.

I will use pandas dataframes as the source of the data. Even if you have it in other objects (like arrays) you can transform them into a dataframe with relative ease. To show how it works, I will import the Boston housing prices data set from sklearn.datasets:

import pandas as pd
from sklearn.datasets import load_boston

#imports dataset
boston = load_boston()

#generates DataFrame bos
bos = pd.DataFrame(boston.data)

#adds columns names to bos
bos.columns = boston.feature_names 

#adds column 'PRICE' to bos
bos['PRICE'] = boston.target

Now let us consider the relation between the column 'RM' and the column 'PRICE', with 'RM'as independent variable. For simplicity, let us consider simple OLS. Here comes the actual answer:

from statsmodels.formula.api import ols

m = ols('PRICE ~ RM',bos).fit()
infl = m.get_influence()
sm_fr = infl.summary_frame()

sm_fr has the columns cooks_d and dffits that you look for.

Memorandum answered 13/9, 2018 at 21:46 Comment(2)
Thanks. This solved my problem. You can also directly get dffits and cook's distance by using this: (c,p) = m.dffits and (c,p) = m.cooks_distance respectively in your code. c contains the value and p is the p-value.Dressel
Is the same p-values that follow the F(p, n-p) distribution? Is this is any way impacted by a low or bad-fitting model? Would p-values be highly significant in this case? online.stat.psu.edu/stat501/lesson/11/11.5Eustashe

© 2022 - 2024 — McMap. All rights reserved.