Most efficient way to loop through an Eigen matrix
Asked Answered
P

5

15

I'm creating some functions to do things like the "separated sum" of negative and positive number, kahan, pairwise and other stuff where it doesn't matter the order I take the elements from the matrix, for example:

template <typename T, int R, int C>
inline T sum(const Eigen::Matrix<T,R,C>& xs)
{
  T sumP(0);
  T sumN(0);
  for (size_t i = 0, nRows = xs.rows(), nCols = xs.cols(); i < nRows; ++i)
   for (size_t j = 0; j < nCols; ++j)
   {
        if (xs(i,j)>0)
          sumP += xs(i,j);
        else if (xs(i,j)<0) //ignore 0 elements: improvement for sparse matrices I think
          sumN += xs(i,j);
   }
 return sumP+sumN;
}

Now, I would like to make this as efficient as possible, so my question is, would it be better to loop through each column of each row like the above, or do the opposite like the the following:

for (size_t i = 0, nRows = xs.rows(), nCols = xs.cols(); i < nCols; ++i)
  for (size_t j = 0; j < nRows; ++j)

(I suppose this depends on the order that the matrix elements are allocated in memory, but I couldn't find this in Eigen's manual).

Also, are there other alternate ways like using iterators (do they exist in Eigen?) that might be slightly faster?

Perchloride answered 29/4, 2013 at 15:55 Comment(0)
C
14

Eigen allocates matrices in column-major (Fortran) order by default (documentation).

The fastest way to iterate over a matrix is in storage order, doing it the wrong way round will increase the number of cache misses (which if your matrix doesn't fit in L1 will dominate your computation time, so read increase your computation time) by a factor of cacheline/elemsize (probably 64/8=8).

If your matrix fits into L1 cache this won't make a difference, but a good compiler should be able to vectorise the loop, which with AVX enabled (on a shiny new core i7) could give you a speedup of as much as 4 times. (256 bits / 64 bits).

Finally don't expect any of Eigen's built-in functions to give you a speed-up (I don't think there are iterators anyhow, but I may be mistaken), they're just going to give you the same (very simple) code.

TLDR: Swap your order of iteration, you need to vary the row index most quickly.

Chainsmoke answered 29/4, 2013 at 19:22 Comment(3)
The page in the link you sent used PlainObjectBase::data() like this for (int i = 0; i < Acolmajor.size(); i++), I didn't new this function existed, maybe it's as fast as simple loops and I don't have to worry if it's a column and row major order matrix. I'll do some benchmarks to check this. Thanks for the answer and link!Perchloride
Ah, I didn't know about that function either. That's even better!Chainsmoke
Yeah! Take a look at my answer: it's way faster than the other two other ways!Perchloride
P
19

I did some benchmarks to checkout which way is quicker, I got the following results (in seconds):

12
30
3
6
23
3

The first line is doing iteration as suggested by @jleahy. The second line is doing iteration as I've done in my code in the question (the inverse order of @jleahy). The third line is doing iteration using PlainObjectBase::data()like this for (int i = 0; i < matrixObject.size(); i++). The other 3 lines repeat the same as the above, but with an temporary as suggest by @lucas92

I also did the same tests but using substituting /if else.*/ for /else/ (no special treatment for sparse matrix) and I got the following (in seconds):

10
27
3
6
24
2

Doing the tests again gave me results pretty similar. I used g++ 4.7.3 with -O3. The code:

#include <ctime>
#include <iostream>
#include <Eigen/Dense>

using namespace std;

 template <typename T, int R, int C>
    inline T sum_kahan1(const Eigen::Matrix<T,R,C>& xs) {
      if (xs.size() == 0) return 0;
      T sumP(0);
      T sumN(0);
      T tP(0);
      T tN(0);
      T cP(0);
      T cN(0);
      T yP(0);
      T yN(0);
      for (size_t i = 0, nRows = xs.rows(), nCols = xs.cols(); i < nCols; ++i)
      for (size_t j = 0; j < nRows; ++j)
      {
          if (xs(j,i)>0)
          {
            yP = xs(j,i) - cP;
          tP = sumP + yP;
          cP = (tP - sumP) - yP;
          sumP = tP;
          }
        else if (xs(j,i)<0)
          {
            yN = xs(j,i) - cN;
          tN = sumN + yN;
          cN = (tN - sumN) - yN;
          sumN = tN;
          }
      }
      return sumP+sumN;
    }

 template <typename T, int R, int C>
    inline T sum_kahan2(const Eigen::Matrix<T,R,C>& xs) {
      if (xs.size() == 0) return 0;
      T sumP(0);
      T sumN(0);
      T tP(0);
      T tN(0);
      T cP(0);
      T cN(0);
      T yP(0);
      T yN(0);
      for (size_t i = 0, nRows = xs.rows(), nCols = xs.cols(); i < nRows; ++i)
      for (size_t j = 0; j < nCols; ++j)
      {
          if (xs(i,j)>0)
          {
            yP = xs(i,j) - cP;
          tP = sumP + yP;
          cP = (tP - sumP) - yP;
          sumP = tP;
          }
        else if (xs(i,j)<0)
          {
            yN = xs(i,j) - cN;
          tN = sumN + yN;
          cN = (tN - sumN) - yN;
          sumN = tN;
          }
      }
      return sumP+sumN;
    }


 template <typename T, int R, int C>
    inline T sum_kahan3(const Eigen::Matrix<T,R,C>& xs) {
      if (xs.size() == 0) return 0;
      T sumP(0);
      T sumN(0);
      T tP(0);
      T tN(0);
      T cP(0);
      T cN(0);
      T yP(0);
      T yN(0);
    for (size_t i = 0, size = xs.size(); i < size; i++)
      {
          if ((*(xs.data() + i))>0)
          {
            yP = (*(xs.data() + i)) - cP;
          tP = sumP + yP;
          cP = (tP - sumP) - yP;
          sumP = tP;
          }
        else if ((*(xs.data() + i))<0)
          {
            yN = (*(xs.data() + i)) - cN;
          tN = sumN + yN;
          cN = (tN - sumN) - yN;
          sumN = tN;
          }
      }
      return sumP+sumN;
    }

 template <typename T, int R, int C>
    inline T sum_kahan1t(const Eigen::Matrix<T,R,C>& xs) {
      if (xs.size() == 0) return 0;
      T sumP(0);
      T sumN(0);
      T tP(0);
      T tN(0);
      T cP(0);
      T cN(0);
      T yP(0);
      T yN(0);
      for (size_t i = 0, nRows = xs.rows(), nCols = xs.cols(); i < nCols; ++i)
      for (size_t j = 0; j < nRows; ++j)
      {
      T temporary = xs(j,i);
          if (temporary>0)
          {
            yP = temporary - cP;
          tP = sumP + yP;
          cP = (tP - sumP) - yP;
          sumP = tP;
          }
        else if (temporary<0)
          {
            yN = temporary - cN;
          tN = sumN + yN;
          cN = (tN - sumN) - yN;
          sumN = tN;
          }
      }
      return sumP+sumN;
    }

 template <typename T, int R, int C>
    inline T sum_kahan2t(const Eigen::Matrix<T,R,C>& xs) {
      if (xs.size() == 0) return 0;
      T sumP(0);
      T sumN(0);
      T tP(0);
      T tN(0);
      T cP(0);
      T cN(0);
      T yP(0);
      T yN(0);
      for (size_t i = 0, nRows = xs.rows(), nCols = xs.cols(); i < nRows; ++i)
      for (size_t j = 0; j < nCols; ++j)
      {
      T temporary = xs(i,j);
          if (temporary>0)
          {
            yP = temporary - cP;
          tP = sumP + yP;
          cP = (tP - sumP) - yP;
          sumP = tP;
          }
        else if (temporary<0)
          {
            yN = temporary - cN;
          tN = sumN + yN;
          cN = (tN - sumN) - yN;
          sumN = tN;
          }
      }
      return sumP+sumN;
    }


 template <typename T, int R, int C>
    inline T sum_kahan3t(const Eigen::Matrix<T,R,C>& xs) {
      if (xs.size() == 0) return 0;
      T sumP(0);
      T sumN(0);
      T tP(0);
      T tN(0);
      T cP(0);
      T cN(0);
      T yP(0);
      T yN(0);
    for (size_t i = 0, size = xs.size(); i < size; i++)
      {
      T temporary = (*(xs.data() + i));
          if (temporary>0)
          {
            yP = temporary - cP;
          tP = sumP + yP;
          cP = (tP - sumP) - yP;
          sumP = tP;
          }
        else if (temporary<0)
          {
            yN = temporary - cN;
          tN = sumN + yN;
          cN = (tN - sumN) - yN;
          sumN = tN;
          }
      }
      return sumP+sumN;
    }

 template <typename T, int R, int C>
    inline T sum_kahan1e(const Eigen::Matrix<T,R,C>& xs) {
      if (xs.size() == 0) return 0;
      T sumP(0);
      T sumN(0);
      T tP(0);
      T tN(0);
      T cP(0);
      T cN(0);
      T yP(0);
      T yN(0);
      for (size_t i = 0, nRows = xs.rows(), nCols = xs.cols(); i < nCols; ++i)
      for (size_t j = 0; j < nRows; ++j)
      {
          if (xs(j,i)>0)
          {
            yP = xs(j,i) - cP;
          tP = sumP + yP;
          cP = (tP - sumP) - yP;
          sumP = tP;
          }
        else
          {
            yN = xs(j,i) - cN;
          tN = sumN + yN;
          cN = (tN - sumN) - yN;
          sumN = tN;
          }
      }
      return sumP+sumN;
    }

 template <typename T, int R, int C>
    inline T sum_kahan2e(const Eigen::Matrix<T,R,C>& xs) {
      if (xs.size() == 0) return 0;
      T sumP(0);
      T sumN(0);
      T tP(0);
      T tN(0);
      T cP(0);
      T cN(0);
      T yP(0);
      T yN(0);
      for (size_t i = 0, nRows = xs.rows(), nCols = xs.cols(); i < nRows; ++i)
      for (size_t j = 0; j < nCols; ++j)
      {
          if (xs(i,j)>0)
          {
            yP = xs(i,j) - cP;
          tP = sumP + yP;
          cP = (tP - sumP) - yP;
          sumP = tP;
          }
        else
          {
            yN = xs(i,j) - cN;
          tN = sumN + yN;
          cN = (tN - sumN) - yN;
          sumN = tN;
          }
      }
      return sumP+sumN;
    }


 template <typename T, int R, int C>
    inline T sum_kahan3e(const Eigen::Matrix<T,R,C>& xs) {
      if (xs.size() == 0) return 0;
      T sumP(0);
      T sumN(0);
      T tP(0);
      T tN(0);
      T cP(0);
      T cN(0);
      T yP(0);
      T yN(0);
    for (size_t i = 0, size = xs.size(); i < size; i++)
      {
          if ((*(xs.data() + i))>0)
          {
            yP = (*(xs.data() + i)) - cP;
          tP = sumP + yP;
          cP = (tP - sumP) - yP;
          sumP = tP;
          }
        else
          {
            yN = (*(xs.data() + i)) - cN;
          tN = sumN + yN;
          cN = (tN - sumN) - yN;
          sumN = tN;
          }
      }
      return sumP+sumN;
    }

 template <typename T, int R, int C>
    inline T sum_kahan1te(const Eigen::Matrix<T,R,C>& xs) {
      if (xs.size() == 0) return 0;
      T sumP(0);
      T sumN(0);
      T tP(0);
      T tN(0);
      T cP(0);
      T cN(0);
      T yP(0);
      T yN(0);
      for (size_t i = 0, nRows = xs.rows(), nCols = xs.cols(); i < nCols; ++i)
      for (size_t j = 0; j < nRows; ++j)
      {
      T temporary = xs(j,i);
          if (temporary>0)
          {
            yP = temporary - cP;
          tP = sumP + yP;
          cP = (tP - sumP) - yP;
          sumP = tP;
          }
        else
          {
            yN = temporary - cN;
          tN = sumN + yN;
          cN = (tN - sumN) - yN;
          sumN = tN;
          }
      }
      return sumP+sumN;
    }

 template <typename T, int R, int C>
    inline T sum_kahan2te(const Eigen::Matrix<T,R,C>& xs) {
      if (xs.size() == 0) return 0;
      T sumP(0);
      T sumN(0);
      T tP(0);
      T tN(0);
      T cP(0);
      T cN(0);
      T yP(0);
      T yN(0);
      for (size_t i = 0, nRows = xs.rows(), nCols = xs.cols(); i < nRows; ++i)
      for (size_t j = 0; j < nCols; ++j)
      {
      T temporary = xs(i,j);
          if (temporary>0)
          {
            yP = temporary - cP;
          tP = sumP + yP;
          cP = (tP - sumP) - yP;
          sumP = tP;
          }
        else
          {
            yN = temporary - cN;
          tN = sumN + yN;
          cN = (tN - sumN) - yN;
          sumN = tN;
          }
      }
      return sumP+sumN;
    }


 template <typename T, int R, int C>
    inline T sum_kahan3te(const Eigen::Matrix<T,R,C>& xs) {
      if (xs.size() == 0) return 0;
      T sumP(0);
      T sumN(0);
      T tP(0);
      T tN(0);
      T cP(0);
      T cN(0);
      T yP(0);
      T yN(0);
    for (size_t i = 0, size = xs.size(); i < size; i++)
      {
      T temporary = (*(xs.data() + i));
          if (temporary>0)
          {
            yP = temporary - cP;
          tP = sumP + yP;
          cP = (tP - sumP) - yP;
          sumP = tP;
          }
        else
          {
            yN = temporary - cN;
          tN = sumN + yN;
          cN = (tN - sumN) - yN;
          sumN = tN;
          }
      }
      return sumP+sumN;
    }


int main() {

    Eigen::Matrix<double, Eigen::Dynamic, Eigen::Dynamic> test = Eigen::Matrix<double, Eigen::Dynamic, Eigen::Dynamic>::Random(10000,10000);

    cout << "start" << endl;   
    int now;

    now = time(0);
    sum_kahan1(test);
    cout << time(0) - now << endl;

    now = time(0);
    sum_kahan2(test);
    cout << time(0) - now << endl;

    now = time(0);
    sum_kahan3(test);
    cout << time(0) - now << endl;

    now = time(0);
    sum_kahan1t(test);
    cout << time(0) - now << endl;

    now = time(0);
    sum_kahan2t(test);
    cout << time(0) - now << endl;

    now = time(0);
    sum_kahan3t(test);
    cout << time(0) - now << endl;

    now = time(0);
    sum_kahan1e(test);
    cout << time(0) - now << endl;

    now = time(0);
    sum_kahan2e(test);
    cout << time(0) - now << endl;

    now = time(0);
    sum_kahan3e(test);
    cout << time(0) - now << endl;

    now = time(0);
    sum_kahan1te(test);
    cout << time(0) - now << endl;

    now = time(0);
    sum_kahan2te(test);
    cout << time(0) - now << endl;

    now = time(0);
    sum_kahan3te(test);
    cout << time(0) - now << endl;

    return 0;
}
Perchloride answered 29/4, 2013 at 20:50 Comment(4)
Benchmarks! Beautiful. You'd be surprised how many people refuse to benchmark their code. I'm surprised that using .data is so much faster, that could be an indication that eigen is doing some work in the .cols, .rows and .xs functions. You could try hoisting xs.size() and xs.data() out of the loop, that might help even more if that's the case (although it's unlikely it might be worth a shot).Chainsmoke
xs.size() is already out of the loop. I was having trouble putting xs.data() out: T * xsBegin = xs.data(); as g++ was outputting those crazy big errors about templates and turns out I was just missing a const: const T * xsBegin = xs.data();.Perchloride
AFAIK for (size_t i = 0, size = xs.size(); i < size; i++) { } is similar to {size_t i = 0, size = xs.size(); for (; i < size; i++) { } }Perchloride
It is, my apologies, I wasn't looking at your code properly and I didn't notice you'd made the optimisation already. Previous comment retracted so not to confuse people.Chainsmoke
C
14

Eigen allocates matrices in column-major (Fortran) order by default (documentation).

The fastest way to iterate over a matrix is in storage order, doing it the wrong way round will increase the number of cache misses (which if your matrix doesn't fit in L1 will dominate your computation time, so read increase your computation time) by a factor of cacheline/elemsize (probably 64/8=8).

If your matrix fits into L1 cache this won't make a difference, but a good compiler should be able to vectorise the loop, which with AVX enabled (on a shiny new core i7) could give you a speedup of as much as 4 times. (256 bits / 64 bits).

Finally don't expect any of Eigen's built-in functions to give you a speed-up (I don't think there are iterators anyhow, but I may be mistaken), they're just going to give you the same (very simple) code.

TLDR: Swap your order of iteration, you need to vary the row index most quickly.

Chainsmoke answered 29/4, 2013 at 19:22 Comment(3)
The page in the link you sent used PlainObjectBase::data() like this for (int i = 0; i < Acolmajor.size(); i++), I didn't new this function existed, maybe it's as fast as simple loops and I don't have to worry if it's a column and row major order matrix. I'll do some benchmarks to check this. Thanks for the answer and link!Perchloride
Ah, I didn't know about that function either. That's even better!Chainsmoke
Yeah! Take a look at my answer: it's way faster than the other two other ways!Perchloride
D
4

I notice that the code is equivalent to the sum of all the entries in the matrix, i.e., you could just do this:

return xs.sum();

I would assume that would perform better since it's only a single pass, and furthermore Eigen ought to "know" how to arrange the passes for optimum performance.

If, however, you want to retain the two passes, you could express this using the coefficient-wise reduction mechanisms, like this:

return (xs.array() > 0).select(xs, 0).sum() +
       (xs.array() < 0).select(xs, 0).sum();

which uses the boolean coefficient-wise select to pick the positive and negative entries. I don't know whether it would outperform the hand-rolled loops, but in theory coding it this way allows Eigen (and the compiler) to know more about your intention, and possibly improve the results.

Daren answered 20/8, 2015 at 1:10 Comment(0)
S
2

Try to store xs(i,j) inside a temporary variable inside the loop so you only call the function once.

Sarnen answered 29/4, 2013 at 17:15 Comment(2)
It's inside an if/else if so you only ever call it once.Chainsmoke
I call it two or three times per iteration. I'll do some benchmarks to check if it's faster to create a temporary.Perchloride
M
0

Unless you're on a processor that lacks hardware support for adding two T's, I would highly recommend removing the if check altogether and simply keeping a single running sum (and retesting the performance that way). I.e.:

   for( ... )
      sum += xs(i,j);

Considering you already need to know the value of the cell in order to evaluate the if condition (i.e. you've already paid the price for the memory fetch), there's little benefit to avoiding the addition of 0 for cells that are zero-valued.

In the benchmark tests you've posted as an answer, you set T to double. Unless you are working with a CPU that lacks hardware floating point support, your CPU would be very likely to execute the double addition operation much faster than it can deal with the conditional branching. In fact, running your code on an Intel Core i9, I was able to achieve more than 10 times speed-up (up to 16 times depending on matrix size) compared to your fastest version simply by removing the if check and always adding the cell's value to the running sum. I realize this is an old question, but the situation should have been no different on CPUs that were around in 2013 either, unless you were dealing with a specialized/embedded CPU. Obviously one should always test on the hardware they're targeting.

PS. This answer assumes you've already taken @jleahy's answer above into account.

Mercator answered 27/10, 2022 at 19:44 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.