ValueError: endog must be in the unit interval
Asked Answered
K

3

17

While using statsmodels, I am getting this weird error: ValueError: endog must be in the unit interval. Can someone give me more information on this error? Google is not helping.

Code that produced the error:

"""
Multiple regression with dummy variables. 
"""

import pandas as pd
import statsmodels.api as sm
import pylab as pl
import numpy as np

df = pd.read_csv('cost_data.csv')
df.columns = ['Cost', 'R(t)', 'Day of Week']
dummy_ranks = pd.get_dummies(df['Day of Week'], prefix='days')
cols_to_keep = ['Cost', 'R(t)']
data = df[cols_to_keep].join(dummy_ranks.ix[:,'days_2':])
data['intercept'] = 1.0

print(data)

train_cols = data.columns[1:]
logit = sm.Logit(data['Cost'], data[train_cols])

result = logit.fit()

print(result.summary())

And the traceback:

Traceback (most recent call last):
  File "multiple_regression_dummy.py", line 20, in <module>
    logit = sm.Logit(data['Cost'], data[train_cols])
  File "/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/site-packages/statsmodels/discrete/discrete_model.py", line 404, in __init__
    raise ValueError("endog must be in the unit interval.")
ValueError: endog must be in the unit interval.
Kirakiran answered 9/7, 2015 at 15:47 Comment(4)
Perhaps check this condition that generates this error: if (self.__class__.__name__ != 'MNLogit' and not np.all((self.endog >= 0) & (self.endog <= 1))): raise ValueError("endog must be in the unit interval.")Rattigan
What's your Cost data? Logit requires that the dependent variable (endog) is in the unit interval. If you want logistic regression with values in another interval, then you need to transform your values so that they are in the the unit interval. However, Logit does not require that the endog are 0, 1 integers, so we can use it for proportions.Hawkinson
Ah Cost is not in the unit interval. Any idea why Logit requires this?Kirakiran
The underlying distribution of Logit is a Bernoulli distribution that takes on values 0 and 1. This can be extended to any values between 0 and 1 but the functions are not defined outside of the unit interval. If you have a positive dependent variable and an exponential mean function then the Poisson distribution can be used, even if the data is continous. For unbound continuous data the usual model is OLS.Hawkinson
M
28

I got this error when my target column had values larger than 1. Make sure your target column is between 0 and 1 (as is required for a Logistic Regression) and try again. For example, if you have target column with values 1-5, make 4 and 5 the positive class and 1,2,3 the negative class. Hope this helps.

Mizzenmast answered 10/9, 2015 at 21:10 Comment(1)
Legend - I had a NaN in my target columnSubequatorial
S
3

It seems like you followed the same logistic regression tutorial that I did: http://blog.yhat.com/posts/logistic-regression-and-python.html

I ended up getting the same Value Error when I fit my logistic regression, and the trick I needed to get it running was making sure to drop all rows of my data with missing values (N/A or np.nan).

This can be done with the pandas function pandas.notnull() as follows :

data = data[pd.notnull(data['Cost'])],

data = data[pd.notnull(data['R(t)'])],

...

and so on until all your variables have the same amount of values to work with.

Hope this helps someone else!

Subacute answered 13/8, 2016 at 21:51 Comment(0)
Q
1

I had the same problem: I change the model from a Classification to a Regression one (I was using a Classification Model .logit in a Regression problem)

You can still use StatsModel, but with OLS, for example, instead of logit. Logit (Logistic Regression) is for Classification problems, but here it seems it is a Regression one. Using OLS, could solve the problem

Quesada answered 25/1, 2022 at 9:39 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.