Is Eigen slow at multiplying small matrices?

Asked 23/9, 2019 at 23:39 Answered 30/5, 2023 at 4:2

I wrote a function that multiplies Eigen matrices of dimension 10x10 together. Then I wrote a naive multiply function CustomMultiply which was surprisingly 2x faster than Eigen's implementation.

I tried a couple of different compilation flags like -O2 and -O3, which did not make a difference.

  #include <Eigen/Core>

  constexpr int dimension = 10;
  using Matrix = Eigen::Matrix<double, dimension, dimension>;

  Matrix CustomMultiply(const Matrix& a, const Matrix& b) {
    Matrix result = Matrix::Zero();
    for (int bcol_idx = 0; bcol_idx < dimension; ++bcol_idx) {
      for (int brow_idx = 0; brow_idx < dimension; ++brow_idx) {
        result.col(bcol_idx).noalias() += a.col(brow_idx) * b(brow_idx, bcol_idx);
      }
    }
    return result;
  }

  Matrix PairwiseMultiplyEachMatrixNoAlias(int num_repetitions, const std::vector<Matrix>& input) {
    Matrix acc = Matrix::Zero();
    for (int i = 0; i < num_repetitions; ++i) {
      for (const auto& matrix_a : input) {
        for (const auto& matrix_b : input) {
          acc.noalias() += matrix_a * matrix_b;
        }
      }
    }
    return acc;
  }

  Matrix PairwiseMultiplyEachMatrixCustom(int num_repetitions, const std::vector<Matrix>& input) {
    Matrix acc = Matrix::Zero();
    for (int i = 0; i < num_repetitions; ++i) {
      for (const auto& matrix_a : input) {
        for (const auto& matrix_b : input) {
          acc.noalias() += CustomMultiply(matrix_a, matrix_b);
        }
      }
    }
    return acc;
  }

PairwiseMultiplyEachMatrixNoAlias is 2x slower on PairwiseMultiplyEachMatrixCustom on my machine when I pass in 100 random matrices as input and use 100 as num_repetitions. My machine details: Intel Xeon CPU E5-2630 v4, Ubuntu 16.04, Eigen 3

Updates: Results are unchanged after the following modifications after helpful discussion in the comments

num_repetitions = 1 and input.size() = 1000
using .lazyProduct() and using .eval() actually leads to further slowdown
clang 8.0.0
g++ 9.2
using flags -march=native -DNDEBUG

Updates 2:
Following up on @dtell's findings with Google Benchmark library, I found an interesting result. Multiplying 2 matrices with Eigen is faster than custom, but multiplying many matrices with Eigen is 2x slower, in line with the previous findings.

Here is my Google Benchmark code. (Note: There was an off-by one in the GenerateRandomMatrices() function below which is now fixed.)

#include <Eigen/Core>
#include <Eigen/StdVector>
#include <benchmark/benchmark.h>

constexpr int dimension = 10;
constexpr int num_random_matrices = 10;
using Matrix = Eigen::Matrix<double, dimension, dimension>;
using Eigen_std_vector = std::vector<Matrix,Eigen::aligned_allocator<Matrix>>;

Eigen_std_vector GetRandomMatrices(int num_matrices) {
  Eigen_std_vector matrices;
  for (int i = 0; i < num_matrices; ++i) {
    matrices.push_back(Matrix::Random());
  }
  return matrices;
}

Matrix CustomMultiply(const Matrix& a, const Matrix& b) {
  Matrix result = Matrix::Zero();
  for (int bcol_idx = 0; bcol_idx < dimension; ++bcol_idx) {
    for (int brow_idx = 0; brow_idx < dimension; ++brow_idx) {
      result.col(bcol_idx).noalias() += a.col(brow_idx) * b(brow_idx, bcol_idx);
    }
  }
  return result;
}

Matrix PairwiseMultiplyEachMatrixNoAlias(int num_repetitions, const Eigen_std_vector& input) {
  Matrix acc = Matrix::Zero();
  for (int i = 0; i < num_repetitions; ++i) {
    for (const auto& matrix_a : input) {
      for (const auto& matrix_b : input) {
        acc.noalias() += matrix_a * matrix_b;
      }
    }
  }
  return acc;
}

Matrix PairwiseMultiplyEachMatrixCustom(int num_repetitions, const Eigen_std_vector& input) {
  Matrix acc = Matrix::Zero();
  for (int i = 0; i < num_repetitions; ++i) {
    for (const auto& matrix_a : input) {
      for (const auto& matrix_b : input) {
        acc.noalias() += CustomMultiply(matrix_a, matrix_b);
      }
    }
  }
  return acc;
}

void BM_PairwiseMultiplyEachMatrixNoAlias(benchmark::State& state) {
  // Perform setup here
  const auto random_matrices = GetRandomMatrices(num_random_matrices);
  for (auto _ : state) {
    benchmark::DoNotOptimize(PairwiseMultiplyEachMatrixNoAlias(1, random_matrices));
  }
}
BENCHMARK(BM_PairwiseMultiplyEachMatrixNoAlias);


void BM_PairwiseMultiplyEachMatrixCustom(benchmark::State& state) {
  // Perform setup here
  const auto random_matrices = GetRandomMatrices(num_random_matrices);
  for (auto _ : state) {
    benchmark::DoNotOptimize(PairwiseMultiplyEachMatrixCustom(1, random_matrices));
  }
}
BENCHMARK(BM_PairwiseMultiplyEachMatrixCustom);

void BM_MultiplySingle(benchmark::State& state) {
  // Perform setup here
  const auto random_matrices = GetRandomMatrices(2);
  for (auto _ : state) {
    benchmark::DoNotOptimize((random_matrices[0] * random_matrices[1]).eval());
  }
}
BENCHMARK(BM_MultiplySingle);

void BM_MultiplySingleCustom(benchmark::State& state) {
  // Perform setup here
  const auto random_matrices = GetRandomMatrices(2);
  for (auto _ : state) {
    benchmark::DoNotOptimize(CustomMultiply(random_matrices[0], random_matrices[1]));
  }
}
BENCHMARK(BM_MultiplySingleCustom);



double TestCustom() {
  const Matrix a = Matrix::Random();
  const Matrix b = Matrix::Random();

  const Matrix c = a * b;
  const Matrix custom_c = CustomMultiply(a, b);

  const double err = (c - custom_c).squaredNorm();
  return err;
}

// Just sanity check the multiplication
void BM_TestCustom(benchmark::State& state) {
  if (TestCustom() > 1e-10) {
    exit(-1);
  }
}
BENCHMARK(BM_TestCustom);

This yields the following mysterious report

Run on (20 X 3100 MHz CPU s)
CPU Caches:
  L1 Data 32K (x10)
  L1 Instruction 32K (x10)
  L2 Unified 256K (x10)
  L3 Unified 25600K (x1)
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
----------------------------------------------------------------------------
Benchmark                                     Time           CPU Iterations
----------------------------------------------------------------------------
BM_PairwiseMultiplyEachMatrixNoAlias      28283 ns      28285 ns      20250
BM_PairwiseMultiplyEachMatrixCustom       14442 ns      14443 ns      48488
BM_MultiplySingle                           791 ns        791 ns     876969
BM_MultiplySingleCustom                     874 ns        874 ns     802052
BM_TestCustom                                 0 ns          0 ns          0

My current hypothesis is that the slowdown is attributable to instruction cache misses. It's possible Eigen's matrix multiply function does bad things to the instruction cache.

VTune output for custom:

VTune output for Eigen:

Maybe someone with more experience with VTune can tell me if I am interpreting this result correctly. The DSB is the decoded instruction cache and MITE has something to do with instruction decoder bandwidth. The Eigen version shows that most instructions are missing the DSB (66% miss rate) and a marked increase in stalling due to MITE bandwidth.

Update 3: After getting reports that the single version of custom was faster, I also reproduced it on my machine. This goes against @dtell's original findings on their machine.

CPU Caches:
  L1 Data 32K (x10)
  L1 Instruction 32K (x10)
  L2 Unified 256K (x10)
  L3 Unified 25600K (x1)
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
----------------------------------------------------------------------------
Benchmark                                     Time           CPU Iterations
----------------------------------------------------------------------------
BM_PairwiseMultiplyEachMatrixNoAlias      34787 ns      34789 ns      16477
BM_PairwiseMultiplyEachMatrixCustom       17901 ns      17902 ns      37759
BM_MultiplySingle                           349 ns        349 ns    2054295
BM_MultiplySingleCustom                     178 ns        178 ns    4624183
BM_TestCustom                                 0 ns          0 ns          0

I wonder if in my previous benchmark result I had left out an optimization flag. In any case, I think the issue is confirmed that Eigen incurs an overhead when multiplying small matrices. If anyone out there has a machine that does not use a uop cache, I would be interested in seeing if the slowdown is less severe.

Nightly answered 23/9, 2019 at 23:39 Comment(19)

Why are you profiling multiplying the same matrixes 100 times in a row? I mean, the correct optimization should be 100 times the value of doing it once, which isn't what you want to profile. Use 1000 matrices instead of 100 as a first step. – Whopper 24/9, 2019 at 1:26

Exactly the same result here with g++ 9.2 and -O2. – Disc 24/9, 2019 at 7:18

@Yakk-AdamNevraumont I didn't get your comment. What optimization? – Disc 24/9, 2019 at 7:19

@benj Doing X 100 times is the same as doing X once then multiplying by 100. When doing a benchmark ensure, to the best of your ability, optimizations cannot reach behaviour you don't want to measure. – Whopper 24/9, 2019 at 11:7

For comparison, can you also try with -march=native -O2 -DNDEBUG and can you try matrix_a.lazyProduct(matrix_b) instead of matrix_a * matrix_b? It appears to me that the threshold for switching to the cache-optimized GEMM is set too low. – Billingsley 24/9, 2019 at 11:28

@Yakk-AdamNevraumont Since the compiler would obviously do that in both methods, it would "only" degrade measurement precision. It does not explain the huge difference. – Disc 24/9, 2019 at 11:50

@BenjaminBihler No, that is not "obvious". In fact, compilers quite often can understand that one piece of code is pure and has no side effects and not another for reasons that do not apply in more realistic code you'd actually want the profiling information for. I have no idea if that is happening here, but I'm saying "eliminate the extra loop". – Whopper 24/9, 2019 at 13:10

@Yakk-AdamNevraumont If I understand correctly you are saying that some compiler could theoretically optimize the outer loop because the input is marked const. For that reason, I have repeated the benchmark for input = 1 and num_matrices = 1000 and found the same effect – Nightly 24/9, 2019 at 16:49

@MarkLiu Sure, that is an improvement! It would be interesting to know if the .noalias() += on col/rows avoids allocation in both cases. In theory the matrix += a*b could be done with no actual allocation of a matrix, while matrix += custommult(a,b) cannot as written. – Whopper 24/9, 2019 at 16:54

@Billingsley Using your suggested flags did not make a difference. Using lazyProduct sped up the computation, but not as much as the naive version - only 1.5x slower than naive. – Nightly 24/9, 2019 at 16:58

Aha, here: eigen.tuxfamily.org/dox/TopicLazyEvaluation.html -- your .noalias() might block evaluate-before-assigning flag on the right (unless the cost model says do it anyhow)? Then if the cost model is off on your hardware, it might spend more time than needed? Try a (matrix_a * matrix_b).eval(); that should make the * code run for certain, which is what your naive version does. – Whopper 24/9, 2019 at 17:0

@Yakk-AdamNevraumont Thanks for the suggestion, but using .eval() did not help, and neither did the raw expression acc += matrix_a * matrix_b;. noalias() leads to a consistent speedup on my machine. – Nightly 24/9, 2019 at 17:6

Thanks for testing it @BenjaminBihler that's useful to know, since I'm on clang – Nightly 24/9, 2019 at 17:17

Playing around with the dimension(you forget a type there by the way) it seems, that eigen's performance starts to be worse than the custom GEMM starting at dim = 8. Since it is a lot faster than the custom one at n=4 for example, it think its switching the internal algorithm. – Serpentiform 24/9, 2019 at 19:17

@Serpentiform Thanks, I corrected the code sample. On my machine, the custom version is faster for dimension 5 to 12. Then it's unclear, until about dimension 20 at which time Eigen becomes faster. – Nightly 24/9, 2019 at 19:55

I can reproduce your results, but the difference is a little bit smaller: 26989ns vs 17616ns. Perhaps eigen uses a different algorithm, and yours happen to be faster. For these small matrices, naive algorithms works well (and even the naive version uses Eigen's functions, it uses SIMD, so your naive algorithm is not a bad implementation for small matrices) – Botts 25/9, 2019 at 7:41

@Botts The mystery is that Eigen's implementation works for a single multiply, but becomes worse when you do it many times – Nightly 25/9, 2019 at 21:6

For me, custom is faster for single as well: 544ns vs 390ns. Consistently, for both clang and gcc. The difference between them varies with the compiler, but custom consistently faster. Btw., it is hard to analyze this. If there is no trivial clue written out by VTune or perf, maybe it takes too much effort to figure out, what's going on (as the compiled asm code is huge). – Botts 25/9, 2019 at 21:51

@Botts Thanks for your report, I have reproduced it on my machine now as well. VTune points to a huge difference in DSB coverage between the versions, and also reports 14.5% of pipeline slots being unfilled due to front-end issues. Do you think the huge asm code supports the hypothesis? – Nightly 25/9, 2019 at 22:14

(gdb) bt
#0  0x00005555555679e3 in Eigen::internal::gemm_pack_rhs<double, long, Eigen::internal::const_blas_data_mapper<double, long, 0>, 4, 0, false, false>::operator()(double*, Eigen::internal::const_blas_data_mapper<double, long, 0> const&, long, long, long, long) ()
#1  0x0000555555566654 in Eigen::internal::general_matrix_matrix_product<long, double, 0, false, double, 0, false, 0>::run(long, long, long, double const*, long, double const*, long, double*, long, double, Eigen::internal::level3_blocking<double, double>&, Eigen::internal::GemmParallelInfo<long>*) ()
#2  0x0000555555565822 in BM_PairwiseMultiplyEachMatrixNoAlias(benchmark::State&) ()
#3  0x000055555556d571 in benchmark::internal::(anonymous namespace)::RunInThread(benchmark::internal::Benchmark::Instance const*, unsigned long, int, benchmark::internal::ThreadManager*) ()
#4  0x000055555556b469 in benchmark::RunSpecifiedBenchmarks(benchmark::BenchmarkReporter*, benchmark::BenchmarkReporter*) ()
#5  0x000055555556a450 in main ()

From stack trace, eigen's matrix multiplication is using a generic multiply method and loop through a dynamic matrix size. For custom implementation, clang aggressively vectorize it and unroll loop, so there's much less branching.

Maybe there's some flag/option for eigen to generate code for this particular size to optimize.

However, if the matrix size is bigger, the Eigen version will perform much better than custom.

Agitation answered 25/9, 2019 at 21:53 Comment(0)

I've rewritten your code using a proper benchmark library, namely Google Benchmark and cannot reproduce your measurements.

My results for -O0 where the second template parameter is the matrix dimension:

Running ./benchmark
Run on (12 X 2900 MHz CPU s)
CPU Caches:
  L1 Data 32K (x6)
  L1 Instruction 32K (x6)
  L2 Unified 262K (x6)
  L3 Unified 12582K (x1)
---------------------------------------------------------------------
Benchmark                              Time           CPU Iterations
---------------------------------------------------------------------
BM_CustomMultiply<double, 3>        5391 ns       5389 ns     105066
BM_CustomMultiply<double, 4>        9365 ns       9364 ns      73649
BM_CustomMultiply<double, 5>       15349 ns      15349 ns      44008
BM_CustomMultiply<double, 6>       20953 ns      20947 ns      32230
BM_CustomMultiply<double, 7>       33328 ns      33318 ns      21584
BM_CustomMultiply<double, 8>       44237 ns      44230 ns      15500
BM_CustomMultiply<double, 9>       57142 ns      57140 ns      11953
BM_CustomMultiply<double, 10>      69382 ns      69382 ns       9998
BM_EigenMultiply<double, 3>         2335 ns       2335 ns     295458
BM_EigenMultiply<double, 4>         1613 ns       1613 ns     457382
BM_EigenMultiply<double, 5>         4791 ns       4791 ns     142992
BM_EigenMultiply<double, 6>         3471 ns       3469 ns     206002
BM_EigenMultiply<double, 7>         9052 ns       9051 ns      78135
BM_EigenMultiply<double, 8>         8655 ns       8655 ns      81717
BM_EigenMultiply<double, 9>        11446 ns      11399 ns      67001
BM_EigenMultiply<double, 10>       15092 ns      15053 ns      46924

As you can see the number of iterations Google Benchmark uses is order of magnitudes higher that your benchmark. Micro-benchmarking is extremely hard especially when you deal with execution times of a few hundred nanoseconds.

To be fair, calling your custom function involves a copy and manually inlining it gives a few nanoseconds, but still not beating Eigen.

Measurement with manually inlined CustomMultiply and -O2 -DNDEBUG -march=native:

Running ./benchmark
Run on (12 X 2900 MHz CPU s)
CPU Caches:
  L1 Data 32K (x6)
  L1 Instruction 32K (x6)
  L2 Unified 262K (x6)
  L3 Unified 12582K (x1)
---------------------------------------------------------------------
Benchmark                              Time           CPU Iterations
---------------------------------------------------------------------
BM_CustomMultiply<double, 3>          51 ns         51 ns   11108114
BM_CustomMultiply<double, 4>          88 ns         88 ns    7683611
BM_CustomMultiply<double, 5>         147 ns        147 ns    4642341
BM_CustomMultiply<double, 6>         213 ns        213 ns    3205627
BM_CustomMultiply<double, 7>         308 ns        308 ns    2246391
BM_CustomMultiply<double, 8>         365 ns        365 ns    1904860
BM_CustomMultiply<double, 9>         556 ns        556 ns    1254953
BM_CustomMultiply<double, 10>        661 ns        661 ns    1027825
BM_EigenMultiply<double, 3>           39 ns         39 ns   17918807
BM_EigenMultiply<double, 4>           69 ns         69 ns    9931755
BM_EigenMultiply<double, 5>          119 ns        119 ns    5801185
BM_EigenMultiply<double, 6>          178 ns        178 ns    3838772
BM_EigenMultiply<double, 7>          256 ns        256 ns    2692898
BM_EigenMultiply<double, 8>          385 ns        385 ns    1826598
BM_EigenMultiply<double, 9>          546 ns        546 ns    1271687
BM_EigenMultiply<double, 10>         644 ns        644 ns    1104798

Edgeworth answered 24/9, 2019 at 20:53 Comment(5)

Thank you @dtell! Can you post the benchmark code? I would like to try it on my machine and also understand how it is getting more accurate results. On my computer I'm using the std::chrono::steady_clock – Nightly 24/9, 2019 at 22:26

In my code I'm calling std::chrono::steady_clock::now() before and after calling my functions, and then reporting the results. – Nightly 24/9, 2019 at 22:38

I have updated my question with a Google Benchmark code sample. Could you let me know if there are problems with it? I am finding the same results – Nightly 24/9, 2019 at 23:36

Oh that’s quite interesting. A major difference (that I thought wouldn’t be major) is that I generate a random matrix in each iteration rather than passing a vector of random matrices. I’ll post by code tonight, when I’m home. – Edgeworth 25/9, 2019 at 11:56

Thanks - I have rerun my benchmark and now even the single multiply custom version is faster. What kind of CPU are you using? – Nightly 25/9, 2019 at 23:21

Because eigen will pack matrix on small matrix, if matrix not fallback to gemv.. pack matrix introduce other overhead..

Abdella answered 30/5, 2023 at 4:2 Comment(2)

Could you add more information? This answer is currently unclear. – Motile 31/5, 2023 at 3:43

@user16217248 you can see also in first answer, the gdb stop at frame which is on breakpoint at "Eigen::internal::gemm_pack_rhs", this behavior cause overhead.. – Abdella 2/6, 2023 at 3:12

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags