Logistic Regression Using statsmodels.api with R syntax in Python
Asked Answered
P

1

6

I am trying to run a simple logistic regression function. I have 4 columns named x1, x2, x3, and x4. x4 has a column that has only zeros and ones. So, I am using this as my dependent variable. To predict the dependent variable, I am using the independent variables x1, x2, and x3. Is my syntax off or how can I properly complete a logistic regression on my data while maintaining the R syntax that Statsmodels.api provides?

The following is my code:

import pandas as pd
import statsmodels.formula.api as smf

df = pd.DataFrame({'x1': [10, 11, 0, 14],
                       'x2': [12, 0, 1, 24],
                       'x3': [0, 65, 3, 2],
                       'x4': [0, 0, 1, 0]})

model = smf.logit(formula='x4 ~ x1 + x2 + x3', data=df).fit()
print(model)

The following is my error:

statsmodels.tools.sm_exceptions.PerfectSeparationError: Perfect separation detected, results not available

I understand what it means but I do not understand how I can avoid this issue. What values are needed to confirm a successful logistic regression algorithm and is my syntax correct and is there a better way to solve what I did (with the R syntax)?

Parkin answered 6/8, 2019 at 5:49 Comment(0)
P
1

I may be misunderstanding the question, but the syntax seems fine -- though I think you want print(model.summary()) rather than print(model). The issue is that your sample size is too small.

For example, this works:

import pandas as pd
import numpy as np
import statsmodels.formula.api as smf

np.random.seed(2)
n=100
df = pd.DataFrame({'x1':np.random.randn(n),
                   'x2': np.random.randn(n),
                   'x3': np.random.randn(n),
                   'x4': np.random.randint(0,2,n)})

model = smf.logit(formula='x4 ~ x1 + x2 + x3', data=df).fit()
print(model.summary())

Changing to n=10 yields the following message under the summary table:

Possibly complete quasi-separation: A fraction 0.40 of observations can be perfectly predicted. This might indicate that there is complete quasi-separation. In this case some parameters will not be identified.

Changing to n=5 yields

PerfectSeparationError: Perfect separation detected, results not available

Polygon answered 6/8, 2019 at 21:15 Comment(1)
Your logic makes sense. The issue isn't syntax. Rather, it is the n value that is the issue. I was considering too few values.Parkin

© 2022 - 2024 — McMap. All rights reserved.