How can the C++ Eigen library perform better than specialized vendor libraries?

Asked 28/4, 2012 at 17:52 Answered 27/9, 2016 at 6:51

I was looking over the performance benchmarks: http://eigen.tuxfamily.org/index.php?title=Benchmark

I could not help but notice that eigen appears to consistently outperform all the specialized vendor libraries. The questions is: how is it possible? One would assume that mkl/goto would use processor specific tuned code, while eigen is rather generic.

Notice this http://download.tuxfamily.org/eigen/btl-results-110323/aat.pdf, essentially a dgemm. For N=1000 Eigen gets roughly 17Gf, MKL only 12Gf

Kenney answered 28/4, 2012 at 17:52 Comment(4)

Those are interesting benchmarks to which you link. I admit that I am initially skeptical of them. Of course, it is possible that someone has come up with a radical improvement on ATLAS, but I wonder if the plots to which you link don't rely on some unusual, special case. The reason I ask is because I have been using ATLAS for years, and moreover working with and corresponding with others who use ATLAS, and (unless I misunderstand the benchmarks) I've heard no whisper of benchmarks like this. But, hey, maybe I'll learn something here today. – Battlefield 28/4, 2012 at 17:59

@Battlefield Those are my thoughts exactly. I can readily believe Atlas may be slower than MKL but slower than Eigen by such huge margins?! – Kenney 28/4, 2012 at 18:2

IN provided graphs, Eigen loses with small matrix sizes in certain comparisons. I also think that you should be able to run benchmarks yoursefl and profile them - to see "WHY" there is difference performance. – Adjunct 28/4, 2012 at 18:41

It is also strange that a logarithmic scaling was used for the matrix dimensions. IMHO this makes no sense. It is clear that for small dimensions peak performance can not be reached. One is usually interested "how fast" peak performance can be reached if dimensions get increased. I used the ATLAS benchmark suite to compare DGEMM of some BLAS libraries (including Eigen) on a Intel Core-2-Duo here: apfel.mathematik.uni-ulm.de/~lehn/sghpc/gemm/page14/index.html I get similar results on other architectures. In each case MKL and ATLAS achieved a higher performance than Eigen. – Kurtz 12/10, 2014 at 17:56

Eigen has lazy evaluation. From How does Eigen compare to BLAS/LAPACK?:

For operations involving complex expressions, Eigen is inherently faster than any BLAS implementation because it can handle and optimize a whole operation globally -- while BLAS forces the programmer to split complex operations into small steps that match the BLAS fixed-function API, which incurs inefficiency due to introduction of temporaries. See for instance the benchmark result of a Y = aX + bY operation which involves two calls to BLAS level1 routines while Eigen automatically generates a single vectorized loop.

The second chart in the benchmarks is Y = a*X + b*Y, which Eigen was specially designed to handle. It should be no wonder that a library wins at a benchmark it was created for. You'll notice that the more generic benchmarks, like matrix-matrix multiplication, don't show any advantage for Eigen.

Glarum answered 28/4, 2012 at 18:6 Comment(8)

I am more concerned with gemms - A*A' are noticably faster than mkl and only slighly slower than Goto. Almost 30% fast than mkl for N=1000. – Kenney 28/4, 2012 at 18:9

You should note that Goto hasn't been maintained for a while now. Some new people took over the code base and have migrated to OpenBLAS. – Glarum 28/4, 2012 at 18:10

that does not explain the huge margin over mkl – Kenney 28/4, 2012 at 18:14

Hmm, I've been digging around BLAS Level 3 docs to remember some of this stuff. GEMM performs a C += A*B rather than C = A*B, right? Ie, BLAS requires addition of the original matrix, even if it's all zeros? That seems like extra overhead. Also, GEMM takes the transpose ops for A and B, which means there's some sort of runtime conditional that must be applied in BLAS. Eigen wouldn't have either of those problems since it would be known at compile-time the exact action to perform. – Glarum 28/4, 2012 at 18:32

the Transpose is handled by basically exchange in the order of the three nested loops. Number of additions is only N^2 versus N^3 multiplies, very negligible. If interested, netlib.org/blas/dgemm.f – Kenney 28/4, 2012 at 18:47

@Glarum BLAS handles Operation on the form C = betaC + alphaAB. So if you want C = AB set beta=0 and alpha=1. And here some recent benchmarks. – Kurtz 5/3, 2016 at 9:31

This answer is a very important one. And in addition, it should point out the advantage of C++ template metaprogramming over plain C, which here in the library Eigen is especially powerful. Also I came to realize many ppl don't know about this concept or how to properly use it (I didn't either, for quite a while), which then may result in disadvantaged benchmarks for Eigen – Marathi 11/12, 2020 at 14:1

This is a misleading answer, which relies on Eigen own benchmarks, which were questioned by many people. With all due respect to the author, I think this answer should be removed, in order not to confuse people, who relies on information from stackoverflow. – Agha 14/6, 2023 at 12:23

Benchmarks are designed to be misinterpreted.

Let's look at the matrix * matrix product. The benchmark available on this page from the Eigen website tells you than Eigen (with its own BLAS) gives timings similar to the MKL for large matrices (n = 1000). I've compared Eigen 3.2.6 with MKL 11.3 on my computer (a laptop with a core i7) and the MKL is 3 times faster than Eigen for such matrices using one thread, and 10 times faster than Eigen using 4 threads. This looks like a completely different conclusion. There are two reasons for this. Eigen 3.2.6 (its internal BLAS) does not use AVX. Moreover, it does not seem to make a good usage of multithreading. This benchmark hides this as they use a CPU that does not have AVX support without multithreading.

Usually, those C++ libraries (Eigen, Armadillo, Blaze) bring two things:

Nice operator overloading: You can use +, * with vectors and matrices. In order to get nice performance, they have to use tricky techniques known as "Smart Template expression" in order to avoid temporary when they reduce the timing (such as y = alpha x1 + beta x2 with y, x1, x2 vectors) and introduce them when they are useful (such as A = B * C with A, B, C matrices). They can also reorder operations for less computations, for instance, if A, B, C are matrices A * B * C can be computed as (A * B) * C or A * (B * C) depending upon their sizes.
Internal BLAS: To compute the product of 2 matrices, they can either rely on their internal BLAS or one externally provided (MKL, OpenBLAS, ATLAS). On Intel chips with large matrices, the MKL il almost impossible to beat. For small matrices, one can beat the MKL as it was not designed for that kind of problems.

Usually, when those libraries provide benchmarks against the MKL, they usually use old hardware, and do not turn on multithreading so they can be on par with the MKL. They might also compare BLAS level 1 operations such as y = alpha x1 + beta x2 with 2 calls to a BLAS level 1 function which is a stupid thing to do anyway.

In a nutshell, those libraries are extremely convenient for their overloading of + and * which is extremely difficult to do without losing performance. They usually do a good job on this. But when they give you benchmark saying that they can be on par or beat the MKL with their own BLAS, be careful and do your own benchmark. You'll usually get different results ;-).

Blackandblue answered 31/12, 2015 at 13:52 Comment(6)

I have not large dense matrix multiplication with Eigen for a while but I think you have to enable OpenMP support to get multiple threads. You're right about AVX though, at least last time I checked. That's why it's not too difficult to beat Eigen's GEMM with AVX and FMA with custom code. But in any case largest dense matrix multiplication is mostly a pissing contest since large sparse matrix multiplication is much more useful in practice. – Vasilikivasilis 1/1, 2016 at 14:49

I have not managed to turn on OpenMP for matrix multiplication on Eigen. I haven't tried very hard though as I dislike all of those libraries (I developed my own with wrappers around the MKL). For what's useful in practice, it always depend upon your domain. On my side, I think I have never multiplied any matrix in one of my production code ;-) The only useful thing to benchmark is your problem on your platform. – Blackandblue 1/1, 2016 at 20:9

I second that. Do your own benchmarks. Here mathematik.uni-ulm.de/~lehn/test_ublas/session8/page01.html and mathematik.uni-ulm.de/~lehn/test_blaze/session1/page01.html I did some benchmarks comparing Intel MKL, Eigen, BLAZE and my own implementation. Its really hard to get even close to Intel MKL. All these benchmarks are pretty much self contained. So you can do them yourself. You should consider "trying to beat Intel MKL" as a hoppy or for self-educational purposes. Otherwise it would be frustrating ;-) – Kurtz 16/2, 2016 at 22:57

And the C++ world knows another trick to make MFLOP benchmarks misleading: Just log-scale the x-axis. This hides the results for large matrices. Never saw anybody outside the C++ box doing this. – Kurtz 16/2, 2016 at 23:8

I disagree with this answer, i.e. with "which is extremely difficult to do without losing performance". Eigen comes with lazy evaluation of stacked operations (by template metaprogramming) and can thus optimize the operations much better than consecutively declared operations - before the compiler actually gets to see the code. However this needs the programmer to a) know this concept and b) often think differently to make use of this possibililty. That's also the reason why I disagree with "Do your own benchmarks", because you might not use the full power of respective libraries yourself. – Marathi 11/12, 2020 at 13:55

@Marathi : I know all the benefits of lazy evaluations. It allows you to write operations like you would do in maths and let the Eigen library stack the operations in the nice order. When I've said that it is extremely difficult to do that, I never meant that Eigen did not solve correctly the problem. In fact, Eigen does it really well. But Eigen can be behind the MKL on some plateforms when performing those "stacked" operations. If your are comptetent, do your own benchmark and check if there is some difference on your platform with your matrix sizes. Anyway, you can let Eigen call the MKL. – Blackandblue 12/12, 2020 at 20:23

About the comparison ATLAS vs. Eigen

Have a look at this thread on the Eigen mailing list starting here:

http://listengine.tuxfamily.org/lists.tuxfamily.org/eigen/2012/07/msg00052.html

It shows for instance that ATLAS outperforms Eigen on the matrix-matrix product by 46%:

http://listengine.tuxfamily.org/lists.tuxfamily.org/eigen/2012/07/msg00062.html

More benchmarks results and details on how the benchmarks were done can be found here:

Intel(R) Core(TM) i5-3470 CPU @ 3.20GHz:

http://www.mathematik.uni-ulm.de/~lehn/bench_FLENS/index.html
http://sourceforge.net/tracker/index.php?func=detail&aid=3540928&group_id=23725&atid=379483

Edit:

For my lecture "Software Basics for High Performance Computing" I created a little framework called ulmBLAS. It contains the ATLAS benchmark suite and students could implement their own matrix-matrix product based on the BLIS papers. You can have a look at the final benchmarks which also measure Eigen:

http://www.mathematik.uni-ulm.de/~lehn/sghpc/gemm/page14/index.html#toc5

You can use the ulmBLAS framework to make your own benchmarks.

Also have a look at

Kurtz answered 27/9, 2012 at 21:29 Comment(5)

Thank you Michael for the ulmBlas developement pages, these are very interesting -- it's quite fascinating that you get rather close to the MKL in terms of performance. – Kantos 10/1, 2015 at 22:8

This is a great answer, unfortunately most of the links don't work (or access is forbidden). I know this is expected all these years later, but an update/refresh on the material would be amazing – Tuscany 22/5, 2021 at 19:37

@LorahAttkins Thanks for telling I haven't noticed that. The links should work now. – Kurtz 24/5, 2021 at 4:48

I am very interested in your lecture "Software Basics for High Performance Computing" but failed to google it out, could you please provide the website of this lecture? – Flashcube 16/8, 2023 at 6:30

Since summer 2020 I changed the topic of the lecture to loosely spoken "Build your own computer and write a compiler for it". So the last time where the topic was "Write your own hardware optimized BLAS" was in summer 2019. The link to the website for the lab session is mathematik.uni-ulm.de/numerik/hpc/ss19_hpc0 P.S. I had to restrict the access to my websites because I was using some google fonts. And in the EU this might be expensive (you can use google translate) heise.de/news/… – Kurtz 24/8, 2023 at 22:51

Generic code can be fast because Compile Time Function Evaluation (CTFE) allows to choose optimal register blocking strategy (small temporary sub-matrixes stored in CPU registers).

Mir GLAS and Intel MKL are faster than Eigen and OpenBLAS. Mir GLAS is more generic compared to Eigen. See also the benchmark and reddit thread.

Potbelly answered 27/9, 2016 at 6:51 Comment(0)

It doesn't seem to consistently outperform other libraries, as can be seen on the graphs further down on that page you linked. So the different libraries are optimized for different use cases, and different libraries are faster for different problems.

This is not surprising, since you usually cannot optimize perfectly for all use cases. Optimizing for one specific operation usually limits the optimization options for other use cases.

Marbles answered 28/4, 2012 at 18:2 Comment(2)

But the core - gemm - are either the fastest or second fastest. For example A*A' are noticably faster than mkl. – Kenney 28/4, 2012 at 18:6

Eigen has the potential of outperforming other libraries on long expressions, because it can optimize the entire expression and generate code for it as a whole. Nothing that has C- or FORTAN-like API has this capability. For this functionality you need C++, Lisp, or something running on top of CLR (C#, F#, etc). – Attenweiler 12/6, 2012 at 17:48

I sent the same question to the ATLAS mailing list some time ago:

http://sourceforge.net/mailarchive/message.php?msg_id=28711667

Clint (the ATLAS developer) does not trust these benchmarks. He suggested some trustworthy benchmark procedure. As soon as I have some free time I will do this kind of benchmarking.

If the BLAS functionality of Eigen is actually faster then that of GotoBLAS/GotoBLAS, ATLAS, MKL then they should provide a standard BLAS interface anyway. This would allow linking of LAPACK against such an Eigen-BLAS. In this case, it would also be an interesting option for Matlab and friends.

Kurtz answered 10/6, 2012 at 14:2 Comment(5)

They actually do provide blas bindings, compile time has option to build them – Kenney 10/6, 2012 at 17:47

I know. And if I have some free time then I will perform the benchmarks as suggested by Clint. However, I would prefer that the Eigen team also does this. The problem with benchmarking any templated C++ libraries are these endless discussions about what compiler version and options have to be used. Compared to that building a library like ATLAS is much more robust in this respect. So I would like to see what benchmarks are actually possible if the Eigen people use their optimal compiler configuration and run such a "BLAS/LAPACK benchmark". – Kurtz 10/6, 2012 at 18:56

Using Eigen with a BLAS interface is downright silly. Eigen shines because it can optimize entire expressions, often generating a vectorizable loop that encompasses all operations. With BLAS-like interfaces, if there's no API available for a particular expression, you have to introduce temporary variables for subexpressions, and that kills performance. Eigen's performance depends on the fact that it leverages the code generation facilities of the C++ compiler. No C-based API can do that, unless the C implementation uses a smart just-in-time compiling runtime (none popular do). – Attenweiler 12/6, 2012 at 17:46

You don't get the point: 1) Functions defined in the BLAS standard are those that are most crucial for the performance of numerical software. For example, a matrix-matrix product of the form (gemm) C = betaC + alphaA*B. 2) The Eigen people claim that they can perform this operation faster than ATLAS or GotoBLAS. And there is doubt that this is true. So the point of interest for the high performance computing people is: How can one check if Eigen is capable to actually do gemm faster then MKL, ATLAS, GotoBLAS. – Kurtz 12/6, 2012 at 18:12

@KubaOber. Beside this, another question is: Are the benchmarks published by the Eigen people trustworthy? They use some questionable benchmark suite. And it is strange that they claim to outperform established libraries but do not inform the maintainer of these libraries about the results (the ATLAS guy never even heard of them). I mean, at least you have to give the others the chance to response and comment the results. – Kurtz 12/6, 2012 at 19:15

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

About the comparison ATLAS vs. Eigen

Recommended topics

Hot tags