How to get the P Value in a Variable from OLSResults in Python?
Asked Answered
C

3

28

The OLSResults of

df2 = pd.read_csv("MultipleRegression.csv")
X = df2[['Distance', 'CarrierNum', 'Day', 'DayOfBooking']]
Y = df2['Price']
X = add_constant(X)
fit = sm.OLS(Y, X).fit()
print(fit.summary())

shows the P values of each attribute to only 3 decimal places.

I need to extract the p value for each attribute like Distance, CarrierNum etc. and print it in scientific notation.

I can extract the coefficients using fit.params[0] or fit.params[1] etc.

Need to get it for all their P values.

Also what does all P values being 0 mean?

Concessionaire answered 10/12, 2016 at 11:38 Comment(1)
dir(fit) and look for likely candidatesCultural
C
27

You need to do fit.pvalues[i] to get the answer where i is the index of independent variables. i.e. fit.pvalues[0] for intercept, fit.pvalues[1] for Distance, etc.

You can also look for all the attributes of an object using dir(<object>).

Concessionaire answered 13/12, 2016 at 18:42 Comment(3)
as pvalues is a pandas Series, can access the specific p values that you want, say 'Price', with fit.pvalues.loc['Price']Gallardo
I am using statsmodel version .13.5. There is no attribute of pvalues for the returned model...Trimly
I just noticed that it should be the model returned by model = sm.OLS, rather than the one returned by mod = model.summary(). Thanks!Trimly
A
6

Instead of using fit.summary() you could use fit.pvalues[attributeIndex] in a for loop to print the p-values of all your features/attributes as follows:

df2 = pd.read_csv("MultipleRegression.csv")
X = df2[['Distance', 'CarrierNum', 'Day', 'DayOfBooking']]
Y = df2['Price']
X = add_constant(X)
fit = sm.OLS(Y, X).fit()
for attributeIndex in range (0, numberOfAttributes):
    print(fit.pvalues[attributeIndex])

==========================================================================

Also what does all P values being 0 mean?

It might be a good outcome. The p-value for each term tests the null hypothesis that the coefficients (b1, b2, ..., bn) are equal to zero causing no effect to the fitting equation y = b0 + b1x1 + b2x2... A low p-value (< 0.05) indicates that you can reject the null hypothesis. In other words, a predictor that has a low p-value is likely to be a meaningful addition to your model because changes in the predictor's value are related to changes in the response variable (y).

On the other hand, a larger (insignificant) p-value suggests that changes in the predictor are not correlated to changes in the response.

Angelita answered 2/1, 2021 at 20:29 Comment(0)
R
1

I have used this solution

df2 = pd.read_csv("MultipleRegression.csv")
X = df2[['Distance', 'CarrierNum', 'Day', 'DayOfBooking']]
Y = df2['Price']
X = add_constant(X)
model = sm.OLS(Y, X).fit()

# Following code snippet will generate sorted dataframe with feature name and it's p-value. 

# Hence, you will see most relevant features on the top (p-values will be sorted in ascending order)

d = {}
for i in X.columns.tolist():
    d[f'{i}'] = model_ols.pvalues[i]

df_pvalue= pd.DataFrame(d.items(), columns=['Var_name', 'p-Value']).sort_values(by = 'p-Value').reset_index(drop=True)
Reimburse answered 18/4, 2022 at 17:0 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.