Major discrepancies between R and Stata for ARIMA
Asked Answered
D

1

5

Using historical Lynx Pelt data (https://www.dropbox.com/s/v0h9oywa4pdjblu/Lynxpelt.csv), here are two tables of AIC values from R and Stata for ARIMA(p,q) models for 0<=p<=5 and 0<=q<=5. Note that for (p,q) = (0,1), (0,2), (0,3), (1,0), (1,1), (1,2), (2,0), (2,1), (2,2), (2,3), (3,0), (3,1), (3,2), (4,0) and (4,1) the values are identical to seven significant digits. However, the remaining cases are wildly different--just look at (4,2)! The coefficient estimates are also very different when the AICs do not match. Is this a bug in the core ARIMA function, or what is going on?

AIC calculations from R for ARIMA(p,q)
          q0        q1       q2       q3       q4
p0 145.25613 100.20123 87.45927 77.57073 85.86376
p1 101.54847  84.91691 82.11806 77.15318 74.26392
p2  63.41165  49.42414 44.14899 40.96787 44.33848
p3  52.26069  49.19660 52.00560 43.50156 45.17175
p4  46.19617  48.19530 49.50422 42.43198 45.71375

R parameter estimates: http://pastie.org/8942238

    AIC ( Stata )   FOR   LOG   MODELS  
    q               
p   0   1   2   3   4
0               100.2012    87.45929    77.57074    83.86378
1   101.5485    84.91692    82.11809    86.44413    74.26394
2   63.41167    49.42417    44.14902    40.96633    40.76029
3   52.26072    49.19663    52.00562    40.37268    42.20399
4   46.19619    48.19532    40.39699    43.12795    na

Stata parameter estimates: http://pastie.org/8942232

Below is the code for creating the AIC table in R. Note that I force the used of Maximum Likelihood, no transformation of parameters, and increased the maximum iterations.

pelts <- read.csv("Lynxpelt.csv")
pelts$log <- log(pelts$W7)
models <- array(list(),5)
aic <- data.frame(q0=rep(NA,5), q1=rep(NA,5), q2=rep(NA,5), q3=rep(NA,5), q4=rep(NA,5), row.names=c("p0", "p1", "p2", "p3", "p4"))

makeModel <- function(p,q) {
    arima(pelts$log, order=c(p,0,q), transform.pars=FALSE, method="ML", optim.control=list(maxit=1000))
}

options(warn=1)

for (p in 0:4) {
    for (q in 0:4) {
        model <- makeModel(p,q)
        models[[p+1]][[q+1]] <- model
        aic[p+1,q+1] <- model$aic
        print(cat("p=",p,", q=",q))
    }
}

aic

And here's the code for Stata:

insheet using Lynxpelt.csv
save Lynxpelt, replace

tsset year
tsline w7

gen logw7=log(w7)
label var logw7 "logarithm of w7"

mat A=J(5,5,0) /*This matrix is a 5*5 matrix with 0s*/
mat list A /*show the matrix A*/

forvalues i=0/4 {
forvalues j=0/4 {
set more off
quietly arima logw7, arima(`i',0,`j')
estat ic
matrix list r(S)
matrix s=r(S)
scalar alpha=s[1,5]
mat A[`i'+1,`j'+1]=alpha
}
}


* ARMA(4,4) cannot be done since stata cannot choose an initial value - we give one manually *
* I will use the estimates from ARMA(3,4) *
* Let's run ARMA(3,4) again *
quietly arima logw7, ar(1/3) ma(1/4)
matrix list e(b)
mat B=e(b)

*Now, let's run ARMA(4,4) with initial values from ARMA(3,4) *
quietly arima logw7, ar(1/4) ma(1/4) from(B)
estat ic
matrix s=r(S)
scalar alpha=s[1,5]
mat A[5,5]=alpha

Edit: added links to parameter estimates & added line to R code to fix "models not found" error

Edit 2: Upon iacobus's advice, manually forced Stata to use BFGS as the optimization method. The (4,3) & (3,3) are much improved. Other values still differ wildly. The (3,2) for example used to match and now is very different.

STATA results with technique(bfgs):
           c1         c2         c3         c4         c5
r1  145.25614  100.20123   87.45929  77.570744  85.863777
r2  101.54848  84.916921   82.11809  86.444131  74.263937
r3  63.411671  49.424167  44.149023  40.966325  42.760294
r4  52.260723  49.196628  40.442078  43.498413  43.622292
r5  46.196192  48.195322  42.396986  42.289595          0

R results from above for easy comparison:

AIC calculations from R for ARIMA(p,q)
          q0        q1       q2       q3       q4
p0 145.25613 100.20123 87.45927 77.57073 85.86376
p1 101.54847  84.91691 82.11806 77.15318 74.26392
p2  63.41165  49.42414 44.14899 40.96787 44.33848
p3  52.26069  49.19660 52.00560 43.50156 45.17175
p4  46.19617  48.19530 49.50422 42.43198 45.71375
Dodecagon answered 16/3, 2014 at 21:39 Comment(6)
I do not use Stata, but perhaps extract the log-likelihood from R for each model and the number of parameters for each model and calculate the AIC yourself. Then check to see if your value for AIC matches the value R reports. That might be a first step.Coadunate
Thanks for the suggestion Mark. The AIC is calculated correctly. Really, the parameter estimates between R & Stata differ thus causing the AICs to differ. I used AIC for the table as it's easier at a quick glance to notice the large discrepancy in regression results for certain p,qDodecagon
@Dodecagon would you mind adding a parameter estimates comparison? I do not have access to Stata at the moment.Abisha
Please make the example reproducible. I get Error in models[[p + 1]][[q + 1]] <- model (from #4) : object 'models' not found when trying to replicate.Sucrose
@user12202013: added links to parameter estimates. Do a control-f with a particular aic in each link to see comparison.Dodecagon
@Ista: added line to R code to define models. Should run fine nowDodecagon
L
10

I think your data is producing a numerically unstable likelihood function, especially for the higher order models. The fact that R (at least for me) is giving me warnings on some of the higher order models and you have trouble fitting them using unrestricted MLE using Stata suggests that there may be some numerical issues. SAS is also giving me warnings about convergence left and right.

If there are numerical issues with the likelihood, this could play into the optimization step. By default, Stata appears to use 5 steps using the Berndt-Hall-Hall-Hausman algorithm followed by 10 steps using BFGS, repeating the combination as required until convergence. R, on the other hand, defaults to using BFGS. You can change that with the optim.method argument but R does not have easy support for using BHHH or moving between BHHH and BFGS like Stata does.

Playing with your data with various different optimizers in R suggests that the AIC that results varies by a decent amount by changing between optimizers. I suspect that this is the cause of the difference between Stata and R's estimates.

I suggest going to Stata and setting the maximization option BFGS (see http://www.stata.com/help.cgi?arima#maximize_options for details on how to do that). I wouldn't be surprised if the Stata estimates converged with those from R after making that change.

Lysozyme answered 17/3, 2014 at 7:40 Comment(3)
This is very useful. It would be nice to (1) visualize the likelihood surface and (2) figure out which answer is actually closest to correct -- i.e. does BFGS or BHHH (or something else) give the best answer? [It's easy to compare likelihoods/AIC values across fits within the same platform/package, but may be tricky to compare across ...] It might be possible to hack this a bit more by debugging through arma and extracting the armafn function for further exploration ...Liggins
Insightful, thank you. New results above in Edit 2. The (4,3) & (3,3) are much improved. Other values still differ wildly. The (3,2) for example used to match and now is very different.Dodecagon
That some of the estimates converged when you changed the method suggests that there are numerical issues with the optimization. That some of the values still don't agree is interesting. I suspect it is still due to numerical instability in the likelihood function for this data. If the function is flat or not numerically stable, the result can have a lot of dependence on the starting values and other parameters. I don't know enough about STATA to see how those compare to R's defaults, but I would bet that it the problem.Lysozyme

© 2022 - 2024 — McMap. All rights reserved.