Why is my Strassen's Matrix Multiplication slow?
Asked Answered
P

2

14

I wrote two Matrix Multiplications programs in C++: Regular MM (source), and Strassen's MM (source), both of which operate on square matrices of sizes 2^k x 2^k(in other words, square matrices of even size).

Results are just terrible. For 1024 x 1024 matrix, Regular MM takes 46.381 sec, while Strassen's MM takes 1484.303 sec (25 minutes !!!!).

I attempted to keep the code as simple as possible. Other Strassen's MM examples found on the web are not that much different from my code. One issue with Strassen's code is obvious - I don't have cutoff point, that switches to regular MM.

What other issues my Strassen's MM code has ???

Thanks !

Direct links to sources
http://pastebin.com/HqHtFpq9
http://pastebin.com/USRQ5tuy

Edit1. Fist, a lot of great advices. Thank you for taking your time and sharing knowledge.

I implemented changes(kept all of my code), added cut-off point. MM of 2048x2048 matrix, with cutoff 512 already gives good results. Regular MM: 191.49s Strassen's MM: 112.179s Significant improvement. Results were obtained on prehistoric Lenovo X61 TabletPC with Intel Centrino processor, using Visual Studio 2012. I will do more checks(to make sure I got the correct results), and will publish the results.

Penang answered 26/11, 2012 at 6:47 Comment(13)
@LuchianGrigore: Oh, that's subtle. And likely also a big part of the problem. Probably a bigger issue than the one I identified actually.Decongestant
@LuchianGrigore Woah, how did you jump to that conclusion? Nowhere in the question does the OP state that it's slower only for powers-of-two. If anything, I bet that 1024 simply isn't big enough to overcome the overhead.Directory
@Directory I assumed one of the algorithms was cache oblivious, since the dimensions were fixed to 2^k. Might be wrong.Package
@Mysticial: All the possible sizes in the code are powers of 2.Decongestant
@LuchianGrigore: One algorithm is dead-simple (and not cache oblivious). The other is also not cache-oblivious, though it's supposed to be a LOT faster.Decongestant
@Decongestant i'd test with sizes that are not powers of 2 - if the problem goes away, that's the reason.Package
@LuchianGrigore: Unfortunately, I believe that Strassen's algorithm is a recursive divide-and-conquer algorithm. I suspect it can be made to work for non-powers of 2. But I also suspect that the OPs implementation does not.Decongestant
@Decongestant I intentionally designed my code for powers of Two, otherwise part of the code needs to be rewritten, since matrices after devision will not be even and need padding (row of zeros).Penang
@LuchianGrigore you are correct !!Penang
@Penang well then, the reason is the one explained herePackage
@Directory that was it (apparently)Package
I would think that power-of-two super-alignment will hurt the normal algorithm more than the Strassen Algorithm since the latter is cache oblivious. Perhaps I'm missing something.Directory
@Penang I just realized that you were already aware of the recursion to 1 problem. So I've expanded my answer a bit. Basically, that performance hit caused that will dwarf anything else. So you can't draw any conclusions unless you have a somewhat optimized implementation.Directory
D
27

One issue with Strassen's code is obvious - I don't have cutoff point, that switches to regular MM.

It's fair to say that recursing down to 1 point is the bulk of (if not the entire) problem. Trying to guess at other performance bottlenecks without addressing this is almost moot due to the massive performance hit that it brings. (In other words, you're comparing Apples to Oranges.)

As discussed in the comments, cache alignment could have an effect, but not to this scale. Furthemore, cache alignment would likely hurt the regular algorithm more than the Strassen algorithm since the latter is cache-oblivious.

void strassen(int **a, int **b, int **c, int tam) {

    // trivial case: when the matrix is 1 X 1:
    if (tam == 1) {
            c[0][0] = a[0][0] * b[0][0];
            return;
    }

That's far too small. While the Strassen algorithm has a smaller complexity, it has a much bigger Big-O constant. For one, you have function call overhead all the way down to 1 element.

This is analogous to using merge or quick sort and recursing all the way down to one element. To be efficient you need to stop the recursion when the size gets small and fall back to the classic algorithm.

In quick/merge sort, you'd fall back to a low-overhead O(n^2) insertion or selection sort. Here you would fall back to the normal O(n^3) matrix multiply.


The threshold which you fall back the classic algorithm should be a tunable threshold that will likely vary depending on the hardware and the ability of the compiler to optimize the code.

For something like Strassen multiplication where the advantage is only O(2.8074) over the classic O(n^3), don't be surprised if this threshold turns out to be very high. (thousands of elements?)


In some applications there can be many algorithms each with decreasing complexity but increasing Big-O. The result is that multiple algorithms become optimal at different sizes.

Large integer multiplication is a notorious example of this:

*Note these example thresholds are approximate and can vary drastically - often by more than a factor of 10.

Directory answered 26/11, 2012 at 7:41 Comment(12)
This is one of the reasons I love StackOverflow. With one question I was shown real world examples where subtle effects that can create performance problems were magnified and demonstrated in an obvious way. And then, of course, this answer is very likely what's causing the algorithm that should be faster to be slower.Decongestant
I tested, and for the naive algorithm the problem I fingered is significant. But it should affect both algorithms significantly and so probably isn't the reason for the poor performance of the supposedly faster algorithm.Decongestant
@Decongestant Yeah, I'd expect the power-of-two alignment penalties to be no more than a factor of 3. Since all four major examples of it on SO have only about 3x performance hit. Here the OP has something like 30x performance difference.Directory
Your cutoffs are really far too high.Decimal
@Mystical - Oh, no, my issue was simpler. The thing I was annoyed with was the fact that each row was being allocated separately which kills locality of reference and introduces indirections to every element access.Decongestant
@Decongestant Wait, why would cache coherency have anything to do with it? This isn't threaded.Directory
@Mystical: Oops, wrong term. 'locality of reference'. sigh It's late and my thinking is fuzzy.Decongestant
@Decongestant Ah... In this case, sacrificing locality to break the alignment actually improves performance.Directory
@Mysticial: Yeah, I was thinking that. Though the indirection is also an issue. So it's not clear exactly how all the competing effects will turn out. My empirical test shows that allocating the array contiguously boosts performance by 50% for the naive algorithm, which is the least cache oblivious.Decongestant
@Mystical: Just nitpicking, really. My rather old gmp installation has the cutoff for Karatsuba multiplication at 31 limbs (around 600 digits), Toom3 around 2000 digits, and FFT around 150000 digits. And I think they improved the FFT multiplication considerably in a more recent release. The threshold for FFT multiplication is not in the billions. And a problem with the floating-point FFT is that it doesn't actually work for nontrivial operand sizes. (It seems to but it doesn't.)Decimal
@Decimal Ah. That's why I gave a factor of 10. Those numbers are roughly the values used in y-cruncher. GMP doesn't use Floating-point FFTs because they are afraid of round-off. So when you factor in floating-point FFTs, it pushes the threshold for SSA (what GMP uses) up to the billions. Beyond 100,000 digits or so, y-cruncher uses a whole set of unpublished algorithms. So I don't actually know the exact thresholds of SSA and NTT other than "well into the billions" - because I've never tested it.Directory
This is one of the best answers that I've ever seen someone give to a question! For me was a light in the middle of the darkness! Thanks a lot!Quacksalver
D
3

So, there may be more problems that this, but your first problem is that you're using arrays of pointers to arrays. And since you're using array sizes that are powers of 2, this is an especially big performance hit over allocating the elements contiguously and using integer division to fold the long array of numbers into rows.

Anyway, that's my first guess as to a problem. As I said, there may be more, and I'll add to this answer as I discover them.

Edit: This likely only contributes a small amount to the problem. The problem is likely the one Luchian Grigore refers to involving cache line contention issues with powers of two.

I verified that my concern is valid for the naive algorithm. The time for the naive algorithm goes down by almost 50% if the array is contiguous instead. Here is the code for this (using a SquareMatrix class that is C++11 dependent) on pastebin.

Decongestant answered 26/11, 2012 at 7:12 Comment(4)
Thank you for your help! I will take a look at your code in the morning.Penang
@newprint: I made a couple of small errors that have no effect on your code but make the SquareMatrix class not safe for general use. I will fix them.Decongestant
here simpe version- void madd(int N, int Xpitch, const double X[], int Ypitch, const double Y[], int Spitch, double S[]) { for (int i = 0; i < N; i++) for (int j = 0; j < N; j++) S[iSpitch + j] = X[iXpitch + j] + Y[i*Ypitch + j];Penang
@newprint: I don't like that version as you have to remember to use the multiply with every matrix access. But it is very nicely C-ish and uses no C++ features at all. :-) My version (with its inline functions) allows the compiler to make some interesting assumptions and do some really nice optimizations while allowing the actual multiply algorithm to still appear to cleanly use multi-dimensional array access.Decongestant

© 2022 - 2024 — McMap. All rights reserved.