ideal() in R package pscl not producing repeatable results
Asked Answered
P

3

7

I'm working with the pscl package in R and trying to get it to produce testable/reproducible results. I've taken a look at the underlying C code and it appears as though GetRNGstate() and PutRNGstate() are being called in the right places but it seems impossible to repeat output from the MCMC model.

I've packaged up the functions in simulationResult from the SoDA package so I can verify the start state of each simulation R on the R side.

library(pscl)
library(SoDA)
run1 <- simulationResult(
  ideal(s109, 
    normalize=TRUE,
    maxiter = 500,
    thin = 10,
    burnin = 0),
  seed = 42)

run2 <- simulationResult(
  ideal(s109, 
    normalize=TRUE,
    maxiter = 500,
    thin = 10,
    burnin = 0),
  seed = 42)

We can verify that the start states are the same at least on the R side:

all.equal(run1@firstState, run2@firstState)

But the output is different:

all.equal(run1@result$xbar, run2@result$xbar)

I can increase the number of iterations but that shouldn't really matter if the RNG state is getting propogated. Am I missing something really simple? Thanks.

Edit: I should also note that all.equal(run1@lastState, run2@lastState) (the end state of each run) should be the same but they end up different. My guess is that some source of contingency outside of the R RNG functions called by C is impacting the number of times those RNG functions are called. Curious.

Edit2

I should also add I'm on R 3.0.1 with pscl 1.04.4 on OS X 10.8.4.

Pulpy answered 7/6, 2013 at 16:43 Comment(0)
K
7

As suspected by OP and @SchaunW, the problem lies in the C code. "A bit" of digging revealed a quite subtle problem (see the source code, not the newest version though):

All the sampling in ideal.c appears in the part of commence iterations, i.e. where functions updatex, updatey and others are used. However, the problem is with one of the arguments of these functions - the matrix ok (ironic, right?). It is used by updatex and updateb and only the positions where ok == 1 are important (in crosscheck, crosscheckx).

Before that, some values of ok are assigned to be 1 in check(y,ok,n,m).

However, at the very beginning the initial values of ok are denoted by

ok = imatrix(n,m);

which allocates a integer matrix (see util.c for imatrix). The problem is that then ok contains various numbers, i.e. not only zeros and sometimes ones. It seems that they are not related to RNG state of R, which explains the behaviour noted by @SchaunW: all.equal(run1@result$xbar, run2@result$xbar) returns TRUE if !any(ok == 1) and vice versa. Also, different number of ones explains different lastState.

I am not an expert in C and I am not sure whether there is a logic error in the code or if the imatrix function should be corrected, but a straightforward fix could be to fill ok with zeros right after initialisation:

ok = imatrix(n,m);
for(a=0; a<n; a++) {
    for(aa=0; aa<m; aa++) {
      ok[a][aa] = 0;
    }
}

Finally, there is also a fix that does not include modifying the C code (it might not be right for your applications though). Functions crossxyi, crossxyj are used instead of crosscheck, crosscheckx (the bad ones) when impute = TRUE for ideal.

Kunz answered 15/6, 2013 at 13:44 Comment(4)
@Adam Hyland, let me know if you want any more details.Kunz
I'll take a look at this tonight. I talked to a co-worker of mine and he noted that IDEAL.c is also calling (via dtnorm) the R rng functions (e.g. exp_rand) a variable number of times (which helps explain the end RNG state being different). I'll see if this impacts those calls or if there are 2 problems. :) Thanks.Pulpy
How does it come that those functions are called a variable number of times? I would guess that it should not be a problem since this fix of filling ok with zeros also returns constant lastState.Kunz
Bingo. This works, or at least it provides the correct result. I'll investigate the other issues with the package on my own time. Thanks!Pulpy
A
3

EDIT

I have not been able to reproduce the results I originally posted. When I got those results the first time, I closed out R, restarted it, and ran the whole thing again just to make sure, and I got the same results again. What appears below is copied exactly from my R console. However, I just tried the code a third (and fourth and fifth) time and it is not working. I'm leaving my original answer up just in case I was onto something and didn't realize it and it can be of use to someone else, but the advice below does not appear to work (at least not consistently).

The problem does seem to lie in the C code. When I open up the ideal function and run it line by line, all.equal returns TRUE for every input in this line of code:

output <- .C("IDEAL", PACKAGE = .package.Name, as.integer(n), 
      as.integer(m), as.integer(d), as.double(yToC), as.integer(maxiter), 
      as.integer(thin), as.integer(impute), as.integer(mda), 
      as.double(xp), as.double(xpv), as.double(bp), as.double(bpv), 
      as.double(xstart), as.double(bstart), xoutput = as.double(rep(0, 
        n * d * numrec)), boutput = as.double(0), as.integer(burnin), 
      as.integer(usefile), as.integer(store.item), as.character(file), 
      as.integer(verbose))

However, when I run the above code multiple times, output$xoutput returns slightly different results each time, even if I call set.seed(42) immediately before each run.

sessionInfo()
R version 2.15.2 (2012-10-26)
Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] grid      splines   stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] SoDA_1.0-5       pscl_1.04.4      vcd_1.2-13       colorspace_1.2-0 gam_1.06.2       coda_0.16-1      lattice_0.20-10  mvtnorm_0.9-9994
[9] MASS_7.3-22     

loaded via a namespace (and not attached):
[1] tools_2.15.2

ORIGINAL ANSWER

The ideal function has a startvals argument. The default for that argument is "eigen". For your call to set.seed to have an effect, you need to change that argument to "random". Here's what you already tried:

run1 <- simulationResult(
   ideal(s109, 
     normalize=TRUE,
     maxiter = 500,
     thin = 10,
     burnin = 0,
     startvals = "eigen"),
   seed = 42)

run2 <- simulationResult(
   ideal(s109, 
     normalize=TRUE,
     maxiter = 500,
     thin = 10,
     burnin = 0,
     startvals = "eigen"),
   seed = 42)

all.equal(run1@firstState, run2@firstState)
[1] TRUE

all.equal(run1@result$xbar, run2@result$xbar)
[1] "Mean relative difference: 0.01832379"

And here's the same thing with startvals set to "random":

run1 <- simulationResult(
   ideal(s109, 
     normalize=TRUE,
     maxiter = 500,
     thin = 10,
     burnin = 0,
     startvals = "random"),
   seed = 42)

run2 <- simulationResult(
   ideal(s109, 
     normalize=TRUE,
     maxiter = 500,
     thin = 10,
     burnin = 0,
     startvals = "random"),
   seed = 42)

all.equal(run1@firstState, run2@firstState)
[1] TRUE    

all.equal(run1@result$xbar, run2@result$xbar)
[1] TRUE

As far as I can see, the need to set startvals to "random" in order to get replicable results is not clearly indicated in the package documentation. I had to play around with it for a while before I figured it out.

Amylo answered 12/6, 2013 at 20:44 Comment(2)
I just tried this out and I'm still seeing a difference in the results. I'll try and provide some more information because I'm curious about this. What is your system/version/etc. ?Pulpy
It consistently gave the right results earlier today, but now it's not. I have no idea what changed within the space of a few hours. See edits above.Amylo
D
1

It's an MCMC model so it necessarily uses random number generation. To get repeatable results, you need to start your analysis by setting a 'seed' for the random number generator. This way each time you build the model, it uses the same "random" numbers (as long as you reset the seed each time you build the model. use the set.seed() function and just feed it an arbitrary value, like 1234.

I'm not familiar with this package and it looks like you might already be setting a seed for random number generation in your function call with seed=42, but I'd recommend setting it explicitly with set.seed() anyway. Your code then becomes:

set.seed(1234)
run1 <- simulationResult(
  ideal(s109, 
    normalize=TRUE,
    maxiter = 500,
    thin = 10,
    burnin = 0),
  seed = 42)

set.seed(1234)
run2 <- simulationResult(
  ideal(s109, 
    normalize=TRUE,
    maxiter = 500,
    thin = 10,
    burnin = 0),
  seed = 42)
Disarm answered 7/6, 2013 at 17:6 Comment(2)
simulationResult should set the seed in the same way as set.seed(), I think. if you do your answer you'll see they still aren't equal all.equal(run1@result$xbar, run2@result$xbar) (either w/ the seed argument set as you did or with it removed so the seed is 1234.Pulpy
Weird. That's all I got. Try contacting the package maintainer: Simon Jackman <jackman at stanford.edu>Disarm

© 2022 - 2024 — McMap. All rights reserved.