how to incorporate C or C++ code into my R code to speed up a MCMC program, using a Metropolis-Hastings algorithm

Asked 10/2, 2012 at 11:36 Answered 24/5, 2012 at 13:20

I am seeking advice on how to incorporate C or C++ code into my R code to speed up a MCMC program, using a Metropolis-Hastings algorithm. I am using an MCMC approach to model the likelihood, given various covariates, that an individual will be assigned a particular rank in a social status hierarchy by a 3rd party (the judge): each judge (approx 80, across 4 villages) was asked to rank a group of individuals (approx 80, across 4 villages) based on their assessment of each individual's social status. Therefore, for each judge I have a vector of ranks corresponding to their judgement of each individual's position in the hierarchy.

To model this I assume that, when assigning ranks, judges are basing their decisions on the relative value of some latent measure of an individual's utility, u. Given this, it can then be assumed that a vector of ranks, r, produced by a given judge is a function of an unobserved vector, u, describing the utility of the individuals being ranked, where the individual with the kth highest value of u will be assigned the kth rank. I model u, using the covariates of interest, as a multivariate normally distributed variable and then determine the likelihood of the observed ranks, given the distribution of u generated by the model.

In addition to estimating the effect of, at most, 5 covariates, I also estimate hyperparameters describing variance between judges and items. Therefore, for every iteration of the chain I estimate a multivariate normal density approximately 8-10 times. As a result, 5000 iterations can take up to 14 hours. Obviously, I need to run it for much more than 5000 runs and so I need a means for dramatically speeding up the process. Given this, my questions are as follows:

(i) Am I right to assume that the best speed gains will be had by running some, if not all of my chain in C or C++?

(ii) assuming the answer to question 1 is yes, how do I go about this? For example, is there a way for me to retain all my R functions, but simply do the looping in C or C++: i.e. can I call my R functions from C and then do looping?

(iii) I guess what I really want to know is how best to approach the incorporation of C or C++ code into my program.

Stomatic answered 10/2, 2012 at 11:36 Comment(1)

You should also make sure your MCMC algorithm is working well; it's sometimes the case that choosing a more appropriate algorithm or choosing better mixing parameters can result in far larger speed gains than anything else. That is, don't just use Metropolis-Hastings blindly, make sure it's a good choice for you. – Euratom 10/2, 2012 at 16:36

First make sure your slow R version is correct. Debugging R code might be easier than debugging C code. Done that? Great. You now have correct code you can compare against.

Next, find out what is taking the time. Use Rprof to run your code and see what is taking the time. I did this for some code I inherited once, and discovered it was spending 90% of the time in the t() function. This was because the programmer had a matrix, A, and was doing t(A) in a zillion places. I did one tA=t(A) at the start, and replaced every t(A) with tA. Massive speedup for no effort. Profile your code first.

Now, you've found your bottleneck. Is it code you can speed up in R? Is it a loop that you can vectorise? Do that. Check your results against your gold standard correct code. Always. Yes, I know its hard to compare algorithms that rely on random numbers, so set the seeds the same and try again.

Still not fast enough? Okay, now maybe you need to rewrite parts (the lowest level parts, generally, and those that were taking the most time in the profiling) in C or C++ or Fortran, or if you are really going for it, in GPU code.

Again, really check the code is giving the same answers as the correct R code. Really check it. If at this stage you find any bugs anywhere in the general method, fix them in what you thought was the correct R code and in your latest version, and rerun all your tests. Build lots of automatic tests. Run them often.

Read up about code refactoring. It's called refactoring because if you tell your boss you are rewriting your code, he or she will say 'why didn't you write it correctly first time?'. If you say you are refactoring your code, they'll say "hmmm... good". THIS ACTUALLY HAPPENS.

As others have said, Rcpp is made of win.

Thermobarograph answered 10/2, 2012 at 13:11 Comment(3)

Can I get that last line stitched on a pillow, please? Pretty please? – Zane 10/2, 2012 at 16:45

Yup: cafepress.co.uk/cp/customize/… – Thermobarograph 10/2, 2012 at 16:50

Also see "Speeding Up Ecological and Evolutionary Computations in R; Essentials of High Performance Computing for Biologists" – Heteromerous 22/4, 2015 at 15:41

A complete example using R, C++ and Rcpp is provided by this blog post which was inspired by a this post on Darren Wilkinson's blog (and he has more follow-ups). The example is also included with recent releases of Rcpp in a directory RcppGibbs and should get you going.

Zane answered 10/2, 2012 at 17:35 Comment(0)

I have a blog post which discusses exactly this topic which I suggest you take a look at:

http://darrenjw.wordpress.com/2011/07/31/faster-gibbs-sampling-mcmc-from-within-r/

(this post is more relevant than the post of mine that Dirk refers to).

Donkey answered 10/2, 2012 at 18:39 Comment(0)

I think the best method currently to integrate C or C++ is the Rcpp package of Dirk Eddelbuettel. You can find a lot of information at his website. There is also a talk at Google that is available through youtube that might be interesting.

Nowadays answered 10/2, 2012 at 12:16 Comment(1)

That makes life for the OP even easier! – Nowadays 10/2, 2012 at 16:47

Check out this project: https://github.com/armstrtw/rcppbugs

Also, here is a link to the R/Fin 2012 talk: https://github.com/downloads/armstrtw/rcppbugs/rcppbugs.pdf

Sepaloid answered 24/5, 2012 at 13:20 Comment(0)

I would suggest to benchmark each step of the MCMC sampler and identify the bottleneck. If you put each full conditional or M-H-step into a function, you can use the R compiler package which might give you 5%-10% speed gain. The next step is to use RCPP.

I think it would be really nice to have a general-purpose RCPP function which generates just one single draw using the M-H algorithm given a likelihood function.

However, with RCPP some things become difficult if you only know the R language: non-standard random distributions (especially truncated ones) and using arrays. You have to think more like a C programmer there.

Multivariate Normal is actually a big issue in R. Dmvnorm is very inefficient and slow. Dmnorm is faster, but it would give me NaNs quicker than dmvnorm in some models.

Neither does take an array of covariance matrices, so it is impossible to vectorize code in many instances. As long as you have a common covariance and means, however, you can vectorize, which is the R-ish strategy to speed up (and which is the oppositve of what you would do in C).

Meandrous answered 12/2, 2012 at 1:33 Comment(0)

Recommended topics

Hot tags