Elementwise matrix multiplication: R versus Rcpp (How to speed this code up?)

Asked 24/7, 2014 at 12:9 Answered 24/7, 2014 at 21:19

I am new to C++ programming (using Rcpp for seamless integration into R), and I would appreciate some advice on how to speed up some calculations.

Consider the following example:

 testmat <- matrix(1:9, nrow=3)
 testvec <- 1:3

 testmat*testvec
   #      [,1] [,2] [,3]
   #[1,]    1    4    7
   #[2,]    4   10   16
   #[3,]    9   18   27

Here, R recycled testvec so that, loosely speaking, testvec "became" a matrix of the same dimensions as testmat for the purpose of this multiplication. Then the Hadamard product is returned. I wish to implement this behavior using Rcpp, that is I want that each element of the i-th row in the matrix testmat is multiplied with the i-th element of the vector testvec. My benchmarks tell me that my implementations are extremely slow, and I would appreciate advise on how to speed this up. Here my code:

First, using Eigen:

#include <RcppEigen.h>
// [[Rcpp::depends(RcppEigen)]]

using namespace Rcpp;
using namespace Eigen;

// [[Rcpp::export]]
NumericMatrix E_matvecprod_elwise(NumericMatrix Xs, NumericVector ys){
  Map<MatrixXd> X(as<Map<MatrixXd> >(Xs));
  Map<VectorXd> y(as<Map<VectorXd> >(ys));

  int k = X.cols();
  int n = X.rows();
  MatrixXd Y(n,k) ;

  // here, I emulate R's recycling. I did not find an easier way of doing this. Any hint appreciated.      
  for(int i = 0; i < k; ++i) {
   Y.col(i) = y;
  }

  MatrixXd out = X.cwiseProduct(Y);
  return wrap(out);
}

Here my implementation using Armadillo (adjusted to follow Dirk's example, see answer below):

#include <RcppArmadillo.h>
// [[Rcpp::depends(RcppArmadillo)]]

using namespace Rcpp;
using namespace arma;

// [[Rcpp::export]]
arma::mat A_matvecprod_elwise(const arma::mat & X, const arma::vec & y){
  int k = X.n_cols ;
  arma::mat Y = repmat(y, 1, k) ;  // 
  arma::mat out = X % Y;  
return out;
}

Benchmarking these solutions using R, Eigen or Armadillo shows that both Eigen and Armadillo are about 2 times slower than R. Is there a way to speed these computations up or to get at least as fast as R? Are there more elegant ways of setting this up? Any advise is appreciated and welcome. (I also encourage tangential remarks about programming style in general as I am new to Rcpp / C++.)

Here some reproducable benchmarks:

 # for comparison, define R function:
 R_matvecprod_elwise <- function(mat, vec) mat*vec

n <- 50000
k <- 50
X <- matrix(rnorm(n*k), nrow=n)
e <- rnorm(n)

benchmark(R_matvecprod_elwise(X, e), A2_matvecprod_elwise(X, e), E_matvecprod_elwise(X,e),
          columns = c("test", "replications", "elapsed", "relative"), order = "relative", replications = 1000)

This yields

                  test replications elapsed relative
1  R_matvecprod_elwise(X, e)         1000   10.89    1.000
2  A_matvecprod_elwise(X, e)         1000   26.87    2.467
3  E_matvecprod_elwise(X, e)         1000   27.73    2.546

As you can see, my Rcpp-solutions perform quite miserably. Any way to do it better?

Valarievalda answered 24/7, 2014 at 12:9 Comment(6)

You could do X = X.cwiseProduct(y.replicate(1, X.cols())); in Eigen and then return Xs; but with a 1e3*1e3 matrix it didn't improve the benchmarks over your Eigen function on my system. – Land 24/7, 2014 at 15:18

In Armadillo, try the .each_row() function: arma.sourceforge.net/docs.html#each_colrow Something along these lines: X.each_row() %= y – Emsmus 24/7, 2014 at 18:15

Looking closer at your code, it's probably more correct to use .each_col() instead: X.each_col() %= y – Emsmus 25/7, 2014 at 6:21

Thanks @mtall! Your suggested change speeds up the arma-function dramatically so that it beats R now, even though the Rcpp version still outperforms arma here. Thanks in any case for your comment. I do appreciate your help. – Valarievalda 29/7, 2014 at 8:39

@Valarievalda - It's a trade-off between code readability, compactness, and speed. While you can always write low-level code that's optimised for a particular task at hand, it almost always comes at the expense of being more difficult to read and maintain (and hence you're more likely to create bugs). Often it's safer and easier to just use the facilities provided by a library, such as the .each_col() approach. – Emsmus 1/8, 2014 at 3:50

@Valarievalda C++ newb here. Where did you insert the X.each_col() %=y in your code above? I am trying to do the same thing, and do not want to modify X in place as the accepted answer does. – Kerrykersey 22/6, 2018 at 23:52

If you want to speed up your calculations you will have to be a little careful about not making copies. This usually means sacrificing readability. Here is a version which makes no copies and modifies matrix X inplace.

// [[Rcpp::export]]
NumericMatrix Rcpp_matvecprod_elwise(NumericMatrix & X, NumericVector & y){
  unsigned int ncol = X.ncol();
  unsigned int nrow = X.nrow();
  int counter = 0;
  for (unsigned int j=0; j<ncol; j++) {
    for (unsigned int i=0; i<nrow; i++)  {
      X[counter++] *= y[i];
    }
  }
  return X;
}

Here is what I get on my machine

 > library(microbenchmark)
 > microbenchmark(R=R_matvecprod_elwise(X, e), Arma=A_matvecprod_elwise(X, e),  Rcpp=Rcpp_matvecprod_elwise(X, e))

Unit: milliseconds
 expr       min        lq    median       uq      max neval
    R  8.262845  9.386214 10.542599 11.53498 12.77650   100
 Arma 18.852685 19.872929 22.782958 26.35522 83.93213   100
 Rcpp  6.391219  6.640780  6.940111  7.32773  7.72021   100

> all.equal(R_matvecprod_elwise(X, e), Rcpp_matvecprod_elwise(X, e))
[1] TRUE

Windhover answered 24/7, 2014 at 18:37 Comment(3)

This looks really good. It is the first code I have run to beat R! If you don't mind me asking, why does declaring all integers as unsigned help? This rules out negative values, but why the speed improvement? Also, I am not sure I understand the line X[counter++] *= y[i];. What does counter++ do here, and what happened to j? I would be highly indebted for a brief comment... – Valarievalda 24/7, 2014 at 22:20

I didn't think the unsigned would matter. I'd be surprised if the benchmarks changed significantly because of it. The speed improvements come because we don't make any copies of the vectors or matrices. The counter variable is to avoid having to compute out where the (i,j) index is in the matrix by exploiting how R lays out the elements in a matrix. – Windhover 24/7, 2014 at 22:39

Yes, it is the memory layout issue that makes it faster. This solution is the only one operating by columns through the matrix, and just looking up the correct scalar. Everything else has to assemble row vectors which is not as cheap. And counter++ is "simply" a good trick to traverse the matrix as one long vector, column after columns -- which gets the speed improvement. – Sadden 25/7, 2014 at 14:10

For starters, I'd write the Armadillo version (interface) as

#include <RcppArmadillo.h>
// [[Rcpp::depends(RcppArmadillo)]]

using namespace Rcpp;
using namespace arma;

// [[Rcpp::export]]
arama::mat A_matvecprod_elwise(const arma::mat & X, const arma::vec & y){
  int k = X.n_cols ;
  arma::mat Y = repmat(y, 1, k) ;  // 
  arma::mat out = X % Y;  
  return out;
}

as you're doing an additional conversion in and out (though the wrap() gets added by the glue code). The const & is notional (as you learned via your last question, a SEXP is a pointer object that is lightweight to copy) but better style.

You didn't show your benchmark results so I can't comment on the effect of matrix size etc pp. I suspect you might get better answers on rcpp-devel than here. Your pick.

Edit: If you really want something cheap and fast, I would just do this:

// [[Rcpp::export]]
mat cheapHadamard(mat X, vec y) {
    // should row dim of X versus length of Y here
    for (unsigned int i=0; i<y.n_elem; i++) X.row(i) *= y(i);
    return X;
}

which allocates no new memory and will hence be faster, and probably be competitive with R.

Test output:

R> cheapHadamard(testmat, testvec)
     [,1] [,2] [,3]
[1,]    1    4    7
[2,]    4   10   16
[3,]    9   18   27
R>

Sadden answered 24/7, 2014 at 12:25 Comment(2)

Thanks again Dirk for your input. I tried using your cheapHadamard-function, but it turns out to be the slowest of them all, which surprises me. Do you have any idea why it might be so slow? – Valarievalda 24/7, 2014 at 14:47

Maybe row-wise access across matrices is slow. Dunno. Now you get profile this -- or look at the (open source after all) R code that beats us all. It may just call a faster BLAS routine... – Sadden 24/7, 2014 at 15:0

My apologies for giving an essentially C answer to a C++ question, but as has been suggested the solution generally lies in the efficient BLAS implementation of things. Unfortunately, BLAS itself lacks a Hadamard multiply so you would have to implement your own.

Here is a pure Rcpp implementation that basically calls C code. If you want to make it proper C++, the worker function can be templated but for most applications using R that isn't a concern. Note that this also operates "in-place", which means that it modifies X without copying it.

// it may be necessary on your system to uncomment one of the following
//#define restrict __restrict__ // gcc/clang
//#define restrict __restrict   // MS Visual Studio
//#define restrict              // remove it completely

#include <Rcpp.h>
using namespace Rcpp;

#include <cstdlib>
using std::size_t;

void hadamardMultiplyMatrixByVectorInPlace(double* restrict x,
                                           size_t numRows, size_t numCols,
                                           const double* restrict y)
{
  if (numRows == 0 || numCols == 0) return;

  for (size_t col = 0; col < numCols; ++col) {
    double* restrict x_col = x + col * numRows;

    for (size_t row = 0; row < numRows; ++row) {
      x_col[row] *= y[row];
    }
  }
}

// [[Rcpp::export]]
NumericMatrix C_matvecprod_elwise_inplace(NumericMatrix& X,
                                          const NumericVector& y)
{
  // do some dimension checking here

  hadamardMultiplyMatrixByVectorInPlace(X.begin(), X.nrow(), X.ncol(),
                                        y.begin());

  return X;
}

Here is a version that makes a copy first. I don't know Rcpp well enough to do this natively and not incur a substantial performance hit. Creating and returning a NumericMatrix(numRows, numCols) on the stack causes the code to run about 30% slower.

#include <Rcpp.h>
using namespace Rcpp;

#include <cstdlib>
using std::size_t;

#include <R.h>
#include <Rdefines.h>

void hadamardMultiplyMatrixByVector(const double* restrict x,
                                    size_t numRows, size_t numCols,
                                    const double* restrict y,
                                    double* restrict z)
{
  if (numRows == 0 || numCols == 0) return;

  for (size_t col = 0; col < numCols; ++col) {
    const double* restrict x_col = x + col * numRows;
    double* restrict z_col = z + col * numRows;

    for (size_t row = 0; row < numRows; ++row) {
      z_col[row] = x_col[row] * y[row];
    }
  }
}

// [[Rcpp::export]]
SEXP C_matvecprod_elwise(const NumericMatrix& X, const NumericVector& y)
{
  size_t numRows = X.nrow();
  size_t numCols = X.ncol();

  // do some dimension checking here

  SEXP Z = PROTECT(Rf_allocVector(REALSXP, (int) (numRows * numCols)));
  SEXP dimsExpr = PROTECT(Rf_allocVector(INTSXP, 2));
  int* dims = INTEGER(dimsExpr);
  dims[0] = (int) numRows;
  dims[1] = (int) numCols;
  Rf_setAttrib(Z, R_DimSymbol, dimsExpr);

  hadamardMultiplyMatrixByVector(X.begin(), X.nrow(), X.ncol(), y.begin(), REAL(Z));

  UNPROTECT(2);

  return Z;
}

If you're curious about usage of restrict, it means that you as the programmer enter a contract with the compiler that different bits of memory do not overlap, allowing the compiler to make certain optimizations. The restrict keyword is part of C++11 (and C99), but many compilers added extensions to C++ for earlier standards.

Some R code to benchmark:

require(rbenchmark)

n <- 50000
k <- 50
X <- matrix(rnorm(n*k), nrow=n)
e <- rnorm(n)

R_matvecprod_elwise <- function(mat, vec) mat*vec

all.equal(R_matvecprod_elwise(X, e), C_matvecprod_elwise(X, e))
X_dup <- X + 0
all.equal(R_matvecprod_elwise(X, e), C_matvecprod_elwise_inplace(X_dup, e))

benchmark(R_matvecprod_elwise(X, e),
          C_matvecprod_elwise(X, e),
          C_matvecprod_elwise_inplace(X, e),
          columns = c("test", "replications", "elapsed", "relative"),
          order = "relative", replications = 1000)

And the results:

                               test replications elapsed relative
3 C_matvecprod_elwise_inplace(X, e)         1000   3.317    1.000
2         C_matvecprod_elwise(X, e)         1000   7.174    2.163
1         R_matvecprod_elwise(X, e)         1000  10.670    3.217

Finally, the in-place version may actually be faster, as the repeated multiplications into the same matrix can cause some overflow mayhem.

Edit:

Removed the loop unrolling, as it provided no benefit and was otherwise distracting.

Relationship answered 24/7, 2014 at 21:19 Comment(7)

Thanks a lot for your answer! This looks very impressive and while I have to admit I don't understand many things yet, I hope some day will. I will try to get it to run soon! – Valarievalda 24/7, 2014 at 22:27

When I benchmarked this along with the other solutions, it did not come up as faster. I use the very last approach of a wrapper calling the C function. – Sadden 25/7, 2014 at 11:12

Did you time the code with and without the explicit loop unrolling? Many compilers will perform unrolling by themselves, and the explicit unrolling makes the code harder to read. Barring that, you might investigate Duff's Device. – Stentorian 25/7, 2014 at 14:16

Personally I am not so interested in such less-readable solution at the high-level user facing code (though we do have some tucked away in side Rcpp Sugar). – Sadden 25/7, 2014 at 19:5

I made a package with 4 versions of the algorithm: regular, unrolled, in-place, and in-place + unrolled. At first, the in-place was always beating the in-place + unrolled, and the unrolled was about nearly as slow as R. Took me a while to figure out, but reversing the order of testing for the in-place versions is what did it. Implementing a pure C benchmark in which it is possible to time just the function calls consistently produces IP (2.6s) < uIP (2.7s) < u (4.2s) < vanilla (4.3s) < R (10.9s). The unrolling is overkill, but could also likely be done better by someone with more experience. – Relationship 25/7, 2014 at 20:25

Likely it is best done by the compiler. You might examine the compiler output for the non-unrolled version to see if any unrolling is being applied by the compiler. The compiler can be smarter about unrolling than anyone can using source code, since you have no control over cache line layout using source code, but the compiler has full control over this (and everything else about code generation). If your unrolled loop spans two cache lines, it will (at times) run slower than if it fits within a single cache line. – Stentorian 26/7, 2014 at 0:50

I wouldn't say the compiler can be smarter than "anyone". For loops with non-trivial contents, e.g. computing an average online, a hand-unrolled version is often 4-6 times faster. But yes, that really is for the folk who implement the libraries to worry about. – Relationship 26/7, 2014 at 1:59

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags