I am trying to build multinomial logit model using python and stata. My data is as follows:
ses_type prog_type read write math prog ses
0 low Diploma 39.2 40.2 46.2 0 0
1 middle general 39.2 38.2 46.2 1 1
2 high Diploma 44.5 44.5 49.5 0 2
3 low Diploma 43.0 43.0 48.0 0 0
4 middle Diploma 44.5 36.5 45.5 0 1
5 high general 47.3 41.3 47.3 1 2
I am trying to predict prog using ses read write and math. Where ses represent socioeconomic status and is a nominal variable therefore I created my model in stata using following command:
mlogit prog i.ses read write math, base(2)
Stata output is as follows:
Iteration 0: log likelihood = -204.09667
Iteration 1: log likelihood = -171.90258
Iteration 2: log likelihood = -170.13513
Iteration 3: log likelihood = -170.11071
Iteration 4: log likelihood = -170.1107
Multinomial logistic regression Number of obs = 200
LR chi2(10) = 67.97
Prob > chi2 = 0.0000
Log likelihood = -170.1107 Pseudo R2 = 0.1665
------------------------------------------------------------------------------
prog | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
0 |
ses |
1 | .6197969 .5059335 1.23 0.221 -.3718146 1.611408
2 | -.5131952 .6280601 -0.82 0.414 -1.74417 .7177799
|
read | -.0405302 .0289314 -1.40 0.161 -.0972346 .0161742
write | -.0459711 .0270153 -1.70 0.089 -.09892 .0069779
math | -.0990497 .0331576 -2.99 0.003 -.1640373 -.0340621
_cons | 9.544131 1.738404 5.49 0.000 6.136921 12.95134
-------------+----------------------------------------------------------------
1 |
ses |
1 | -.3350861 .4607246 -0.73 0.467 -1.23809 .5679176
2 | -.8687013 .5363968 -1.62 0.105 -1.92002 .182617
|
read | -.0226249 .0264534 -0.86 0.392 -.0744726 .0292228
write | -.011618 .0266782 -0.44 0.663 -.0639063 .0406703
math | -.0591301 .0299996 -1.97 0.049 -.1179283 -.000332
_cons | 5.041193 1.524174 3.31 0.001 2.053866 8.028519
-------------+----------------------------------------------------------------
2 | (base outcome)
------------------------------------------------------------------------------
I tried to replicate the same results using scikit learn module in python. Following is the code:
data = pd.read_csv("C://Users/Furqan/Desktop/random_data.csv")
train_x = np.array(data[['read', 'write', 'math','ses ']])
train_y = np.array(data['prog'])
mul_lr = linear_model.LogisticRegression(multi_class='multinomial',
solver='newton-cg').fit(train_x, train_y)
print(mul_lr.intercept_)
print(mul_lr.coef_)
The output values (intercept and coefficient) are as follows:
[ 4.76438772 0.19347405 -4.95786177]
[[-0.01735513 -0.02731273 -0.04463257 0.01721334]
[-0.00319366 0.00783135 -0.00689664 -0.24480926]
[ 0.02054879 0.01948137 0.05152921 0.22759592]]
The values turn out to be different.
My first question is why the results tend to be different?
My second question is that in case of nominal predictor variable how can we instruct python that ses is an indicator variable?
EDIT:
Link to data file
ses==1
andses==2
if you want to mimic the Stata output. See, e.g., pandas.pydata.org/pandas-docs/stable/generated/… – Domenic