Replicating Stata Probit with robust errors in R
Asked Answered
D

3

9

I am trying to replicate a Stata Output in R. I am using the dataset affairs. I am having trouble replicating the probit function with robust standard errors.

The Stata code looks like that:

probit affair male age yrsmarr kids relig educ ratemarr, r

I've started with:

 probit1 <- glm(affair ~ male + age + yrsmarr + kids + relig + educ + ratemarr, 
           family = binomial (link = "probit"), data = mydata)

Then I've tried various adjustments with the sandwich package, such as:

myProbit <- function(probit1, vcov = sandwich(..., adjust = TRUE)) {
            print(coeftest(probit1, vcov = sandwich(probit1, adjust = TRUE)))
}

Or (with all types HC0 to HC5):

myProbit <- function(probit1, vcov = sandwich) {
            print(coeftest(probit1, vcovHC(probit1, type = "HC0"))  
}

Or this, as was suggested here (do I have to enter something different for object?):

sandwich1 <- function(object, ...) sandwich(object) * nobs(object) / (nobs(object) - 1)
coeftest(probit1, vcov = sandwich1)

None of these attempts led to the same standard errors or z-values from the stata output.

Hoping for some constructive ideas!

Thanks in advance!

Defeatist answered 14/5, 2015 at 11:41 Comment(5)
Take a look at example 5 here and the paragraph right above. As an aside, if you have heteroskedastic errors, this approach consistently estimates the standard errors of biased and inconsistent parameters. Many people think this is a silly thing to do.Fidgety
Maybe you can post the full replication code along with the output? Currently, it is not exactly clear to me which version of the data you have used and what the results in Stata and R are, respectively.Tranquilize
Thanks @Dimitriy V. Masterov for posting your results. So it is not just a factor as from the degrees-of-freedom adjustment. The R/sandwich code is really identical (just using different make.link results), hence I'm a bit surprised that the strategy works for replicating logit but not probit. I'm not sure how this could happen...Tranquilize
@AchimZeileis The sandwich1 method works for the standard errors in the logit-Regression using the same data but I couldn't manage to get the correct chi2 with it. Any ideas on why that may be?Defeatist
If you mean the chi-squared statistic from the Wald test, then you can use the waldtest() function from the lmtest package or the linearHypothesis() function from the car package. Both allow for optional vcov arguments to be plugged in.Tranquilize
F
4

For folks who are considering jumping on this wagon, here is some code demonstrating the problem (data here):

clear
set more off
capture ssc install bcuse
capture ssc install rsource
bcuse affairs

saveold affairs, version(12) replace

rsource, terminator(XXX)
  library("foreign")
  library("lmtest")
  library("sandwich")
  mydata<-read.dta("affairs.dta")
  probit1<-glm(affair ~ male + age + yrsmarr + kids + relig + educ + ratemarr, family = binomial (link = "probit"), data = mydata)
  sandwich1 <- function(object,...) sandwich(object) * nobs(object)/(nobs(object) - 1)
  coeftest(probit1,vcov = sandwich1)
XXX 

probit affair male age yrsmarr kids relig educ ratemarr, robust cformat(%9.6f) nolog

R gives:

z test of coefficients:

             Estimate Std. Error z value  Pr(>|z|)    
(Intercept)  0.764157   0.546692  1.3978 0.1621780    
male         0.188816   0.133260  1.4169 0.1565119    
age         -0.024400   0.011423 -2.1361 0.0326725 *  
yrsmarr      0.054608   0.019025  2.8703 0.0041014 ** 
kids         0.208072   0.168222  1.2369 0.2161261    
relig       -0.186085   0.053968 -3.4480 0.0005647 ***
educ         0.015506   0.026389  0.5876 0.5568012    
ratemarr    -0.272711   0.053668 -5.0814 3.746e-07 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Stata yields:

Probit regression                               Number of obs     =        601
                                                Wald chi2(7)      =      54.93
                                                Prob > chi2       =     0.0000
Log pseudolikelihood =  -305.2525               Pseudo R2         =     0.0961

------------------------------------------------------------------------------
             |               Robust
      affair |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
        male |   0.188817   0.131927     1.43   0.152    -0.069755    0.447390
         age |  -0.024400   0.011124    -2.19   0.028    -0.046202   -0.002597
     yrsmarr |   0.054608   0.018963     2.88   0.004     0.017441    0.091775
        kids |   0.208075   0.166243     1.25   0.211    -0.117754    0.533905
       relig |  -0.186085   0.053240    -3.50   0.000    -0.290435   -0.081736
        educ |   0.015505   0.026355     0.59   0.556    -0.036150    0.067161
    ratemarr |  -0.272710   0.053392    -5.11   0.000    -0.377356   -0.168064
       _cons |   0.764160   0.534335     1.43   0.153    -0.283117    1.811437
------------------------------------------------------------------------------

Addendum:

The difference in covariance estimation of coefficients is due to the different fitting algorithms. In R, the glm command uses iterative least-square method, while Stata's probit uses an ML method based on Newton-Raphson algorithm. You can match what R is doing with glm in Stata with irls option:

glm affair male age yrsmarr kids relig educ ratemarr, irls family(binomial) link(probit) robust

This yields:

Generalized linear models                         No. of obs      =        601
Optimization     : MQL Fisher scoring             Residual df     =        593
                   (IRLS EIM)                     Scale parameter =          1
Deviance         =  610.5049916                   (1/df) Deviance =   1.029519
Pearson          =  619.0405832                   (1/df) Pearson  =   1.043913

Variance function: V(u) = u*(1-u)                 [Bernoulli]
Link function    : g(u) = invnorm(u)              [Probit]

                                                  BIC             =  -3183.862

------------------------------------------------------------------------------
             |             Semirobust
      affair |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
        male |   0.188817   0.133260     1.42   0.157    -0.072367    0.450002
         age |  -0.024400   0.011422    -2.14   0.033    -0.046787   -0.002012
     yrsmarr |   0.054608   0.019025     2.87   0.004     0.017319    0.091897
        kids |   0.208075   0.168222     1.24   0.216    -0.121634    0.537785
       relig |  -0.186085   0.053968    -3.45   0.001    -0.291862   -0.080309
        educ |   0.015505   0.026389     0.59   0.557    -0.036216    0.067226
    ratemarr |  -0.272710   0.053668    -5.08   0.000    -0.377898   -0.167522
       _cons |   0.764160   0.546693     1.40   0.162    -0.307338    1.835657
------------------------------------------------------------------------------

These will be close, though not identical. I am not sure how to get R to use something like NR without a whole lot of work.

Fidgety answered 14/5, 2015 at 23:23 Comment(1)
Thank you for illustrating it once more! Since I don't have a stata license and only a physical print, I couldn't try to experiment with the data on Stata. It seems as if , r uses different standard errors for probit and logit, but I have only basic knowledge of Stata, so I can't figure it outDefeatist
I
3

I am using matrix approach as described in detail here (p.57) to match the R results with Stata. However, I couldn't match the result exactly yet. I think the small difference might be due to difference in scores. Scores in R match with Stata only up to 4 decimal places .

Stata

clear all
bcuse affairs

probit affair male age yrsmarr kids relig educ ratemarr
mat var_nr=e(V)
predict double u, score
matrix accum s = male age yrsmarr kids relig educ ratemarr [iweight=u^2*601/600] //n=601,n-1=600
matrix rv = var_nr*s*var_nr
mat diagrv=vecdiag(rv)
matmap diagrv rse,m(sqrt(@)) //install matmap 
mat list rse //standard errors

This gives you the same standard errors as:

qui probit affair male age yrsmarr kids relig educ ratemarr,r



rse[1,8]
       affair:    affair:    affair:    affair:    affair:    affair:    affair:    affair:
         male        age    yrsmarr       kids      relig       educ   ratemarr      _cons
r1  .13192707  .01112372  .01896336  .16624258  .05324046  .02635524  .05339163  .53433495

R:

library(AER) # Affairs data
data(Affairs)
mydata<-Affairs
mydata$affairs<-with(mydata,ifelse(affairs>0,1,affairs)) # convert to 1 and 0 
probit1<-glm(affairs ~ gender+ age + yearsmarried + children + religiousness+education + rating,family = binomial(link = "probit"),data = mydata)
u<-subset(estfun(probit1),select="(Intercept)") #scores: perfectly matches to 4 decimals with Stata: difference may be due to this step
w0<-u%*%t(u)*(601/600) #(n/n-1)
iweight<-matrix(0,nrow=601,ncol=601) #perfectly matches to 4 decimals with Stata 
diag(iweight)<-diag(w0) 
x<-model.matrix(probit1)  
s<-t(x)%*%iweight%*%x #doesn't match with Stata : 
rv<-vcov(probit1)%*%s%*%vcov(probit1)
rse<-sqrt(diag(rv)) # standard  errors
   rse
  (Intercept)    gendermale           age  yearsmarried   childrenyes religiousness     education        rating 
   0.54669177    0.13325951    0.01142258    0.01902537    0.16822161    0.05396841    0.02638902    0.05366828 

This matches with:

 sandwich1 <- function(object, ...) sandwich(object) * nobs(object) / (nobs(object) - 1)
coeftest(probit1, vcov = sandwich1) 

Conclusion: The difference in results between R and Stata is due to difference in scores (matches only up to 4 decimal places).

Inweave answered 16/5, 2015 at 3:33 Comment(1)
Interesting insight! Unfortunately I think that is beyond the level of my understanding of R to have any chance to fix it.Defeatist
S
3

To close the circle on this discussion, it is possible to match the original Stata output in R by using sampleSelection::probit for estimation and the sandwich package (I used version 2.5) to compute robust standard errors. The probit function uses maximum likelihood, as does its Stata counterpart.

As in the original post, the Stata code is

probit affair male age yrsmarr kids relig educ ratemarr, robust

which gives

------------------------------------------------------------------------------
             |               Robust
      affair |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
        male |   .1888175   .1319271     1.43   0.152    -.0697548    .4473898
         age |  -.0243996   .0111237    -2.19   0.028    -.0462017   -.0025975
     yrsmarr |    .054608   .0189634     2.88   0.004     .0174405    .0917755
        kids |   .2080754   .1662426     1.25   0.211     -.117754    .5339049
       relig |  -.1860854   .0532405    -3.50   0.000    -.2904348    -.081736
        educ |   .0155052   .0263552     0.59   0.556    -.0361501    .0671605
    ratemarr |  -.2727101   .0533916    -5.11   0.000    -.3773558   -.1680644
       _cons |     .76416    .534335     1.43   0.153    -.2831173    1.811437
------------------------------------------------------------------------------

The R code which gives the same results is

library(AER)
library(sampleSelection)
data(Affairs)
Affairs$affair = Affairs$affairs > 0
Affairs$male = Affairs$gender == 'male'
reg = probit(affair ~ male + age + yearsmarried + children + religiousness +
           education + rating, data=Affairs)
print(coeftest(reg, vcovCL), digits=6)

This gives

                Estimate Std. Error  t value   Pr(>|t|)    
(Intercept)    0.7641600  0.5343350  1.43011  0.1532109    
maleTRUE       0.1888175  0.1319271  1.43123  0.1528921    
age           -0.0243996  0.0111237 -2.19347  0.0286608 *  
yearsmarried   0.0546080  0.0189634  2.87966  0.0041248 ** 
childrenyes    0.2080755  0.1662426  1.25164  0.2111955    
religiousness -0.1860854  0.0532405 -3.49519  0.0005091 ***
education      0.0155052  0.0263552  0.58832  0.5565446    
rating        -0.2727101  0.0533916 -5.10773 4.4012e-07 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Using these functions, both compute maximum likelihood probit estimates, and both compute robust standard errors. As an aside: Hats off to the authors of the sandwich package, which (IMO) has really cleaned up standard error calculations in R.

Sieve answered 24/8, 2018 at 14:44 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.