splitting data into test and train, making a logistic regression model in pandas
Asked Answered
M

1

8

I'm trying to run this code: (credit goes to Greg)

import pandas as pd
from sklearn.model_selection import train_test_split
import statsmodels.api as sm

quality = pd.read_csv("https://courses.edx.org/c4x/MITx/15.071x/asset/quality.csv")
train, test = train_test_split(quality, train_size=0.75, random_state=1)

qualityTrain = pd.DataFrame(train, columns=quality.columns)
qualityTest = pd.DataFrame(test, columns=quality.columns)

qualityTrain['PoorCare'] = qualityTrain['PoorCare'].astype(int)

cols = ['OfficeVisits', 'Narcotics']
x = qualityTrain[cols]
x = sm.add_constant(x)
y = qualityTrain['PoorCare']

model = sm.Logit(y, x).fit()
model.summary()

But I'm getting:

AttributeError: 'int' object has no attribute 'exp'

on the second to last line. This is clearly introduced by sampling the data (train_test_split), because the model fits just fine on the whole unmodified dataset.

How to fix this?

Maurine answered 23/3, 2015 at 22:46 Comment(0)
D
7

Just convert the x variable to floats:

model = sm.Logit(y, x.astype(float)).fit()

I get the following result:

<class 'statsmodels.iolib.summary.Summary'>
"""
                           Logit Regression Results                           
==============================================================================
Dep. Variable:               PoorCare   No. Observations:                   98
Model:                          Logit   Df Residuals:                       95
Method:                           MLE   Df Model:                            2
Date:                Mon, 23 Mar 2015   Pseudo R-squ.:                  0.2390
Time:                        16:45:51   Log-Likelihood:                -39.714
converged:                       True   LL-Null:                       -52.188
                                        LLR p-value:                 3.823e-06
================================================================================
                   coef    std err          z      P>|z|      [95.0% Conf. Int.]
--------------------------------------------------------------------------------
const           -2.7718      0.561     -4.940      0.000        -3.872    -1.672
OfficeVisits     0.0680      0.031      2.211      0.027         0.008     0.128
Narcotics        0.1223      0.041      2.991      0.003         0.042     0.203
================================================================================
"""
Draggle answered 23/3, 2015 at 23:47 Comment(4)
Thanks. But it is strange that it's not capable of fitting to integer data, isn't it?Maurine
running the example: train_test_split returns an array of dtype object. The master version of statsmodels raises now an exception if one of the arrays is an object dtype.Weekday
Thanks for answers here. Quick question @josef — is there now a statsmodels (or pandas) native train/test split function out there? Easy enough to make my own, just curious if there's an "official" one. Thanks!Gant
statsmodels does not have a train/test split function. AFAIK, neither does pandas.Weekday

© 2022 - 2024 — McMap. All rights reserved.