Why is Strassen matrix multiplication so much slower than standard matrix multiplication?

Asked 15/7, 2012 at 21:25 Answered 27/5, 2021 at 14:43

Solved c++performance matrix multiplication strassen

I have written programs in C++, Python and Java for matrix multiplication and tested their speed for multiplying two 2000 x 2000 matrices (see post). The standard ikj-implentation - which is in enter image description here - took:

C++: 15 seconds (Source)
Python: 6 minutes 13 seconds (Source)

Now I have implemented the Strassen algorithm for matrix multiplication - which is in enter image description here - in Python and C++ as it was on wikipedia. These are the times I've got:

C++: 45 minutes (Source)
Python: Killed after 10 hours (Source)

Why is Strassen matrix multiplication so much slower than standard matrix multiplication?

Ideas:

Some cache effects
Implementation:
- error (the resulting 2000 x 2000 matrix is correct)
- null-multiplication (shouldn't be that important for 2000 x 2000 -> 2048 x 2048)

This is especially astonishing, as it seems to contradict the experiences of others:

Why is my Strassen Matrix multiplier so fast?
Matrix multiplication: Strassen vs. Standard - Strassen was also slower for him, but it was at least in the same order of magnitude.

edit: The reason why Strassen matrix multiplication was slower in my case, were:

I made it fully recursive (see tam)
I had two functions strassen and strassenRecursive. The first one resized the matrix to a power of two, if required and called the second one. But strassenRecursive didn't recursively call itself, but strassen.

Neighboring answered 15/7, 2012 at 21:25 Comment(4)

Haven't checked it, but there are a lot of new vectors being allocated. I imagine the memory allocation times are what is killing it. – Tractable 15/7, 2012 at 21:33

Voo's answer basically covers the memory allocation issue as well, since stopping the recursion sooner will reduce the number of allocations. BTW: On my computer I found a good value to be around 250 for the cutoff. – Tractable 15/7, 2012 at 22:7

By the way, your posted source can't be experimented with by anybody, because you don't post the data file. This means that nobody can do anything except speculate. – Tunable 15/7, 2012 at 22:13

@DeadMG: actually the data file is there, just up a couple of levels in the Testing directory. – Tractable 15/7, 2012 at 23:2

The basic problem is that you're recursing down to a leaf size of 1 with your strassen implementaiton. Strassen's algorithm has a better Big O complexity, but constants do matter in reality, which means in reality you're better off with a standard n^3 matrix multiplication for smaller problem sizes.

So to greatly improve your program instead of doing:

if (tam == 1) {
        C[0][0] = A[0][0] * B[0][0];
        return;
    }

use if (tam == LEAF_SIZE) // iterative solution here. LEAF_SIZE should be a constant that you have to experimentally determine for your given architecture. Depending on the architecture it may be larger or smaller - there are architectures where the constant factors for strassen are so large that it's basically always worse than a simpler n^3 implementation for sensible matrix sizes. It all depends.

Everett answered 15/7, 2012 at 21:30 Comment(8)

@Mysticial Ah it's better so, your time is better spent answering questions that mere humans will find hard to answer ;) – Everett 16/7, 2012 at 1:14

You were right. I've added LEAF_SIZE in this script: github.com/MartinThoma/matrix-multiplication/blob/master/… . For Leaf size 10 the time dropped to 66.50 seconds, for 20 to 29.96 seconds and for 50 to 18.80 seconds. How can I better (more structured, automatically) test for good values of LEAF_SIZE than changing the value in the code, recompiling, testing and trying other ones? Do you know an easy possibility to graph it? (Should I ask another question, as this seems to go in another direction than my previous one?) – Neighboring 16/7, 2012 at 12:20

@moose Well make the program take the leaf size as an input parameter. Personally I do the following then: For every leaf size run the program ten times (more is better, but 10 is somewhat accurate) and store all values in a text file (64.txt, 128.txtetc) - that's a shell script job obviously. Afterwards use a simple script (I like python) that takes the runtimes, throws away the 2 fastest/slowest and computes the average of the rest and outputs that data as CSV. And a CSV has the great advantage that excel/openoffice and co can all read it and generate nice graphs with two clicks. – Everett 16/7, 2012 at 12:30

Oh and just a note, any leaf size that's not a power of 2 will be equal to the next lower power of 2 (if I'm not completely confused right now), so there aren't that many to test anyhow. – Everett 16/7, 2012 at 12:36

Thanks for your help. I've just plotted the results: cloud.github.com/downloads/MartinThoma/matrix-multiplication/… – Neighboring 16/7, 2012 at 15:1

@moose Well done. It could be interesting to increase the leaf size further to 2^8 and 2^9 to see when the effect starts to reverse itself. At that point you have your global optimum for your architecture. Although it may be easier to see that for larger matrices as outside differences while executing the program can become dominant for execution times around 10seconds. – Everett 16/7, 2012 at 16:11

On another sidenote: Your java program probably executes quite some part of it interpreted (or with OSR) which can increase runtime. If you're mostly interested in how fast the java program will run after JITting you'll have to change the benchmark a bit (not use time command, but instead print runtime of matrix after some warmup). Although with the right compiler (options) the speed difference may come down to vectorization/auto parallelization in C which the JVM sadly won't do, so you may not see much differences. – Everett 16/7, 2012 at 16:16

Testing on an ARM Cortex A9 based platform resulted a leaf size of 32 to be optimal for 512x512 matrix. The difference is 5 sec for LS=32 compared to 193 sec for LS=1 !!! – Cineaste 6/11, 2012 at 10:38

Well, "arithmetic operations" are not the only things that count. It's not like everything else is free.

My naive guess would be that all this memory-allocating and copying beats the gain from having fewer arithmetic operations...

Memory access, in particular, can be quite expensive when it gets out of the cache, In comparison, arihmetic operations could be considered free :-)

Spiroid answered 15/7, 2012 at 21:37 Comment(2)

..and, in case of C++, an optimization could probably be using placement new on a sufficiently large block of memory allocated beforehand. – Brady 15/7, 2012 at 21:45

Agree, I think that all of the memory mechanics going on here are a big source of slowdown. – Tunable 15/7, 2012 at 22:15

I remember that I did the same thing back in college. My implementation was in Java. I also wrote a script to test the code, I had more than 10000 test cases with random matrices of different sizes (22) ~ (81928192). I did not let the recursion go to the scalar level, I used all power of 2 as stopping points. I found a range where Strassen's is more efficient and a range that is worse than the naive algorithm.

I did not investigate cache, memory, or JVM (garbage collection).

I attributed the findings when I presented in front of the class to the fact that Strassen's algorithm asymptotic complex is measured in term of the number of multiplications. It was devised in a time when computers did additions faster than multiplication.

Nowadays CPUs multiple as fast as they add (number of cycles). If examines both algorithms, you will find that Strassen's has less arithmetic operation than the naive algorithm only if the size is less than 2^10 (if I remember correctly)

People answered 27/5, 2021 at 14:43 Comment(0)

Although the Strassen algorithm has a smaller Big O notation, in order to take advantage of this you would need to multiply to matrcies that are too big to solve on most standard machines and even super computers.

Think of it this way

one problem is x^3 , the other is X^1.6734 + 8x^(1/2) +x .....

Conspectus answered 15/7, 2012 at 21:34 Comment(2)

Not really. You generally get a cutoff value in the hundreds on modern machines for Strassen. And really, 600x600 matrizes are small in this day and age. Heck problems with 50k x 50k matrizes aren't noteworthy today (9gb of memory? there are desktops with 16gb+ around) – Everett 15/7, 2012 at 22:2

You may be referring to the Coppersmith-Winograd algorithm: en.wikipedia.org/wiki/Coppersmith%E2%80%93Winograd_algorithm – Cineaste 1/11, 2012 at 23:46

Recommended topics

Hot tags