Multinomial Logit model Python and Stata different results

ses_type prog_type read write math prog ses 0 low Diploma 39.2 40.2 46.2 0 0 1 middle general 39.2 38.2 46.2 1 1 2 high Diploma 44.5 44.5 49.5 0 2 3 low Diploma 43.0 43.0 48.0 0 0 4 middle Diploma 44.5 36.5 45.5 0 1 5 high general 47.3 41.3 47.3 1 2

Iteration 0: log likelihood = -204.09667 Iteration 1: log likelihood = -171.90258 Iteration 2: log likelihood = -170.13513 Iteration 3: log likelihood = -170.11071 Iteration 4: log likelihood = -170.1107 Multinomial logistic regression Number of obs = 200 LR chi2(10) = 67.97 Prob > chi2 = 0.0000 Log likelihood = -170.1107 Pseudo R2 = 0.1665 ------------------------------------------------------------------------------ prog | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- 0 | ses | 1 | .6197969 .5059335 1.23 0.221 -.3718146 1.611408 2 | -.5131952 .6280601 -0.82 0.414 -1.74417 .7177799 | read | -.0405302 .0289314 -1.40 0.161 -.0972346 .0161742 write | -.0459711 .0270153 -1.70 0.089 -.09892 .0069779 math | -.0990497 .0331576 -2.99 0.003 -.1640373 -.0340621 _cons | 9.544131 1.738404 5.49 0.000 6.136921 12.95134 -------------+---------------------------------------------------------------- 1 | ses | 1 | -.3350861 .4607246 -0.73 0.467 -1.23809 .5679176 2 | -.8687013 .5363968 -1.62 0.105 -1.92002 .182617 | read | -.0226249 .0264534 -0.86 0.392 -.0744726 .0292228 write | -.011618 .0266782 -0.44 0.663 -.0639063 .0406703 math | -.0591301 .0299996 -1.97 0.049 -.1179283 -.000332 _cons | 5.041193 1.524174 3.31 0.001 2.053866 8.028519 -------------+---------------------------------------------------------------- 2 | (base outcome) ------------------------------------------------------------------------------

data = pd.read_csv("C://Users/Furqan/Desktop/random_data.csv") train_x = np.array(data[['read', 'write', 'math','ses ']]) train_y = np.array(data['prog']) mul_lr = linear_model.LogisticRegression(multi_class='multinomial', solver='newton-cg').fit(train_x, train_y) print(mul_lr.intercept_) print(mul_lr.coef_)

There are several issues that make Stata and sklearn results differ:

Different actual predictors in Stata and sklearn
Different representations of fitted parameters
Different goal functions when fitting the model

We need to change all the three conditions to achieve similar outputs.

1. Making the dummy variables

The formula that Stata uses for the linear part is

 prediction = a0 + a1 * [ses==1] + a2 * [ses==2] + a3 * read + a4 * write + a5 * math

Sklearn, in turn, knows nothing about the categorical nature of ses, and tries to use

 prediction = a0 + a1 * ses + a3 * read + a4 * write + a5 * math

To enable categorical predictions, you need to preprocess the data. This is the only possible way to include categorical variables into an sklearn logistic regression. I find pd.get_dummies() the most convenient way to do this.

The following code creates the dummy variable for ses and then drops the "low" level, which evidently corresponds to ses=0 in your example:

import pandas as pd, numpy as np
from sklearn import linear_model

data = pd.read_csv("d1.csv", sep='\t')
data.columns = data.columns.str.strip()

raw_x = data.drop('prog', axis=1)
# making the dummies
train_x = pd.get_dummies(raw_x, columns=['ses']).drop('ses_low ', axis=1)
print(train_x.columns)
train_y = data['prog']

mul_lr = linear_model.LogisticRegression(multi_class='multinomial',
                                         solver='newton-cg').fit(train_x, train_y)
reorder = [4, 3, 0, 1, 2] # the order in which coefficents show up in Stata

print(mul_lr.intercept_)
print(mul_lr.coef_[:, reorder])

It outputs

['read', 'write', 'math', 'ses_high ', 'ses_middle ']
[ 4.67331919  0.19082335 -4.86414254]
[[ 0.47140512 -0.08236331 -0.01909793 -0.02680609 -0.04587383]
 [-0.36381476 -0.33294749 -0.0021255   0.00765828 -0.00703075]
 [-0.10759035  0.4153108   0.02122343  0.01914781  0.05290458]]

You see that Python has successfully encoded sess into 'ses_high ' and 'ses_middle ', but failed to produce the expected coefficients.

By the way, I have changed the order of coef_ columns in the output, to have it look like in Stata.

2. Rearranging the results

That's because Stata views the third category of the outcome (prog=='honors ') as base outcome, and subtracts all its parameters from the rest of the parameters. In Python, you can reproduce this by running

print(mul_lr.intercept_ - mul_lr.intercept_[-1])
print((mul_lr.coef_  - mul_lr.coef_[-1])[:, reorder])

which gives you

[9.53746174 5.0549659  0.        ]
[[ 0.57899547 -0.4976741  -0.04032136 -0.0459539  -0.09877841]
 [-0.25622441 -0.74825829 -0.02334893 -0.01148954 -0.05993533]
 [ 0.          0.          0.          0.          0.        ]]

You can see now that the parameters now are close to what Stata gives:

intercepts of (9.53, 5.05) in Python vs (9.54, 5.04) in Stata
first-outcome coefficients (0.57, -0.49, ...) vs (0.61, -0.51, ...)
second-outcome coefficients (-0.25, -0.74, ...) vs (-0.33, -0.86, ...)

Can you see the pattern? In sklearn, slope coefficients are smaller (closer to zero) than in Stata. This is not an accident!

3. Dealing with regularization

This happens because sklearn intentionally shrinks the slope coefficients towards 0, by adding a quadratic penalty on coefficients to the likelihood function it maximizes. This makes the estimates biased but more stable, even in case of harsh multicollinearity. In Bayesian terms, such regularization corresponds to a zero-mean Gaussian prior on all the coefficients. You can learn more about regularization in the wiki.

In sklearn, this quadratic penalty is controlled by the positive C parameter: the smaller it is, the more regularization you get. You can think of it as the prior variance of each slope coefficient. The default value is C=1, but you can make it larger, like C=1000000, which means almost no regularization. In this case, the output is almost identical to that of Stata:

mul_lr2 = linear_model.LogisticRegression(
    multi_class='multinomial', solver='newton-cg', C=1000000
).fit(train_x, train_y)
print(mul_lr2.intercept_ - mul_lr2.intercept_[-1])
print((mul_lr2.coef_  - mul_lr2.coef_[-1])[:, reorder])

which gives you

[9.54412644 5.04126452 0.        ]
[[ 0.61978951 -0.51320481 -0.04053013 -0.0459711  -0.09904948]
 [-0.33508605 -0.86869799 -0.02262518 -0.01161839 -0.05913068]
 [ 0.          0.          0.          0.          0.        ]]

The results still differ a little (like in the 5-th decimal), but with even less regularization the difference fill shrink further.

1. Making the dummy variables

2. Rearranging the results

3. Dealing with regularization

Recommended topics

Hot tags