Matrix multiplication: Small difference in matrix size, large difference in timings
Asked Answered
L

5

84

I have a matrix multiply code that looks like this:

for(i = 0; i < dimension; i++)
    for(j = 0; j < dimension; j++)
        for(k = 0; k < dimension; k++)
            C[dimension*i+j] += A[dimension*i+k] * B[dimension*k+j];

Here, the size of the matrix is represented by dimension. Now, if the size of the matrices is 2000, it takes 147 seconds to run this piece of code, whereas if the size of the matrices is 2048, it takes 447 seconds. So while the difference in no. of multiplications is (2048*2048*2048)/(2000*2000*2000) = 1.073, the difference in the timings is 447/147 = 3. Can someone explain why this happens? I expected it to scale linearly, which does not happen. I am not trying to make the fastest matrix multiply code, simply trying to understand why it happens.

Specs: AMD Opteron dual core node (2.2GHz), 2G RAM, gcc v 4.5.0

Program compiled as gcc -O3 simple.c

I have run this on Intel's icc compiler as well, and seen similar results.

EDIT:

As suggested in the comments/answers, I ran the code with dimension=2060 and it takes 145 seconds.

Heres the complete program:

#include <stdlib.h>
#include <stdio.h>
#include <sys/time.h>

/* change dimension size as needed */
const int dimension = 2048;
struct timeval tv; 

double timestamp()
{
        double t;
        gettimeofday(&tv, NULL);
        t = tv.tv_sec + (tv.tv_usec/1000000.0);
        return t;
}

int main(int argc, char *argv[])
{
        int i, j, k;
        double *A, *B, *C, start, end;

        A = (double*)malloc(dimension*dimension*sizeof(double));
        B = (double*)malloc(dimension*dimension*sizeof(double));
        C = (double*)malloc(dimension*dimension*sizeof(double));

        srand(292);

        for(i = 0; i < dimension; i++)
                for(j = 0; j < dimension; j++)
                {   
                        A[dimension*i+j] = (rand()/(RAND_MAX + 1.0));
                        B[dimension*i+j] = (rand()/(RAND_MAX + 1.0));
                        C[dimension*i+j] = 0.0;
                }   

        start = timestamp();
        for(i = 0; i < dimension; i++)
                for(j = 0; j < dimension; j++)
                        for(k = 0; k < dimension; k++)
                                C[dimension*i+j] += A[dimension*i+k] *
                                        B[dimension*k+j];

        end = timestamp();
        printf("\nsecs:%f\n", end-start);

        free(A);
        free(B);
        free(C);

        return 0;
}
Lipscomb answered 26/10, 2011 at 16:28 Comment(9)
Probably key to your understanding is that matrix multiplication doesn't scale linearly, your code is on the order of O(n^3).Processional
Maybe caching related, considering the power-of-two-ness of 2048?Ossie
run some shell scripts that step through from 2000 to 2048 and graph the times. It might provide some insight as to what's going on. Just note it will take about 3-6 hours to run, depending on where the break-even point is, so make sure you have other stuff to do in the mean time :)Fertility
@Processional I don't know how this is related in any way to his problem. He is totally aware of the complexity of his algorithm. Have you even read the question?Ossie
Try a test with e.g. dimension = 2060 - this will tell you if the problem is related to e.g. cache size or whether it's a super-alignment problem such as cache thrashing or TLB thrashing.Criminal
Ok, I ran it with dimension=2060, and it took 145 seconds (so thats 447 sec for 2048, 145 sec for 2060). Could you elaborate on what happens (and perhaps submit it as an answer)?Lipscomb
@KVM: Mysticial has now expanded his answer to cover super-alignment and cache thrashing etcCriminal
I edited my answer addressing this for L2. Paul R is right. Though I'm tempted to think it's thrashing the L2 <-> memory instead of the TLB.Putamen
Note that transposing one of the matrices (can be done in place) will lead to better results for these typical sizes (the break even point may vary). Indeed, transposing is O(n^2) (vs. O(n^3) multiplication) and the memory is accessed sequentially for both matrices, leading to better cache use.Bathrobe
P
91

Here's my wild guess: cache

It could be that you can fit 2 rows of 2000 doubles into the cache. Which is slighly less than the 32kb L1 cache. (while leaving room other necessary things)

But when you bump it up to 2048, it uses the entire cache (and you spill some because you need room for other things)

Assuming the cache policy is LRU, spilling the cache just a tiny bit will cause the entire row to be repeatedly flushed and reloaded into the L1 cache.

The other possibility is cache associativity due to the power-of-two. Though I think that processor is 2-way L1 associative so I don't think it matters in this case. (but I'll throw the idea out there anyway)

Possible Explanation 2: Conflict cache misses due to super-alignment on the L2 cache.

Your B array is being iterated on the column. So the access is strided. Your total data size is 2k x 2k which is about 32 MB per matrix. That's much larger than your L2 cache.

When the data is not aligned perfectly, you will have decent spatial locality on B. Although you are hopping rows and only using one element per cacheline, the cacheline stays in the L2 cache to be reused by the next iteration of the middle loop.

However, when the data is aligned perfectly (2048), these hops will all land on the same "cache way" and will far exceed your L2 cache associativity. Therefore, the accessed cache lines of B will not stay in cache for the next iteration. Instead, they will need to be pulled in all the way from ram.

Putamen answered 26/10, 2011 at 16:32 Comment(8)
I agree in suspecting cache. You can do a set of experiments and plot runtime vs. dimension. If it's cache, you would see linearity in the neighborhood of similar sizes, with some sharp breaking points where you get a big step and change in linear slope.Suez
Not just cache size - when the matrices are super-aligned as in the 2048 case then you can start to see problems with cache thrashing, TLB thrashing, etc. Try it with e.g. 2060 and see what happens...Criminal
I ran it with dimension=2060 and it took 145 seconds. Looking at explanation 2, this too should poor spatial locality. For dimension >= 2048, cache lines of B will need to fetched from RAM, right?Lipscomb
There's no spatial locality of B in the inner loop. But there is in the middle loop. Super-alignment, will prevent the needed parts of B from staying in cache across iterations of the middle loop.Putamen
@AhmedMasud And I don't think using times explains his problem, either.Ossie
nice answer. posted another way to test L2.Bellay
@Putamen I am afraid I still did not get the picture. I tried with dimension=2047, 2049 and it took 145 seconds. Only for 2048 it takes 445 seconds. I dont understand what it means when you say "hops land on the same cache way". I still think anything >= 2048 should show similar behaviour.Lipscomb
Because of the way caches work, an N-way cache can only hold at most N cachelines with the same address modulo a large power-of-two. (I don't know the exact number unless you tell me what processor model # you have.) When N = 2048, the cachelines accessed by b all have address with the same modulo over the power-of-two. So they will conflict. (Google: "Conflict Cache Miss")Putamen
D
36

You are definitely getting what I call a cache resonance. This is similar to aliasing, but not exactly the same. Let me explain.

Caches are hardware data structures that extract one part of the address and use it as an index in a table, not unlike an array in software. (In fact, we call them arrays in hardware.) The cache array contains cache lines of data, and tags - sometimes one such entry per index in the array (direct mapped), sometimes several such (N-way set associativity). A second part of the address is extracted and compared to the tag stored in the array. Together, the index and tag uniquely identify a cache line memory address. Finally, the rest of the address bits identifies which bytes in the cache line are addressed, along with the size of the access.

Usually the index and tag are simple bitfields. So a memory address looks like

  ...Tag... | ...Index... | Offset_within_Cache_Line

(Sometimes the index and tag are hashes, e.g. a few XORs of other bits into the mid-range bits that are the index. Much more rarely, sometimes the index, and more rarely the tag, are things like taking cache line address modulo a prime number. These more complicated index calculations are attempts to combat the problem of resonance, which I explain here. All suffer some form of resonance, but the simplest bitfield extraction schemes suffer resonance on common access patterns, as you have found.)

So, typical values... there are many different models of "Opteron Dual Core", and I do not see anything here that specifies which one you have. Choosing one at random, the most recent manual I see on AMD's website, Bios and Kernel Developer's Guide (BKDG) for AMD Family 15h Models 00h-0Fh, March 12, 2012.

(Family 15h = Bulldozer family, the most recent high end processor - the BKDG mentions dual core, although I don't know the product number that is exactly what you describe. But, anyway, the same idea of resonance applies to all processors, it is just that the parameters like cache size and associativity may vary a bit.)

From p.33:

The AMD Family 15h processor contains a 16-Kbyte, 4-way predicted L1 data cache with two 128- bit ports. This is a write-through cache that supports up to two 128 Byte loads per cycle. It is divided into 16 banks, each 16 bytes wide. [...] Only one load can be performed from a given bank of the L1 cache in a single cycle.

To sum up:

  • 64 byte cache line => 6 offset bits within cache line

  • 16KB/4-way => the resonance is 4KB.

    I.e. address bits 0-5 are the cache line offset.

  • 16KB / 64B cache lines => 2^14/2^6 = 2^8=256 cache lines in the cache.
    (Bugfix: I originally miscalculated this as 128. that I have fixed all dependencies.)

  • 4 way associative => 256/4 = 64 indexes in the cache array. I (Intel) call these "sets".

    i.e. you can consider the cache to be an array of 32 entries or sets, each entry containing 4 cache lines ad their tags. (It's more complicated than this, but that's okay).

(By the way, the terms "set" and "way" have varying definitions.)

  • there are 6 index bits, bits 6-11 in the simplest scheme.

    This means that any cache lines that have exactly the same values in the index bits, bits 6-11, will map to the same set of the cache.

Now look at your program.

C[dimension*i+j] += A[dimension*i+k] * B[dimension*k+j];

Loop k is the innermost loop. The base type is double, 8 bytes. If dimension=2048, i.e. 2K, then successive elements of B[dimension*k+j] accessed by the loop will be 2048 * 8 = 16K bytes apart. They will all map to the same set of the L1 cache - they will all have the same index in the cache. Which means that, instead of there being 256 cache lines in the cache available for use there will only be 4 - the "4-way associativity" of the cache.

I.e. you will probably get a cache miss every 4 iterations around this loop. Not good.

(Actually, things are a little more complicated. But the above is a good first understanding. The addresses of entries of B mentioned above is a virtual address. So there might be slightly different physical addresses. Moreover, Bulldozer has a way predictive cache, probably using virtual addresses bits so that it doesn't have to wait for a virtual to physical address translation. But, in any case: your code has a "resonance" of 16K. The L1 data cache has a resonance of 16K. Not good.)]

If you change the dimension just a little bit, e.g. to 2048+1, then the addresses of array B will be spread across all of the sets of the cache. And you will get significantly fewer cache misses.

It is a fairly common optimization to pad your arrays, e.g. to change 2048 to 2049, to avoid this srt of resonance. But "cache blocking is an even more important optimization. http://suif.stanford.edu/papers/lam-asplos91.pdf


In addition to the cache line resonance, there are other things going on here. For example, the L1 cache has 16 banks, each 16 bytes wide. With dimension = 2048, successive B accesses in the inner loop will always go to the same bank. So they can't go in parallel - and if the A access happens to go to the same bank, you will lose.

I don't think, looking at it, that this is as big as the cache resonance.

And, yes, possibly, there may be aliasing going. E.g. the STLF (Store To Load Forwarding buffers) may be comparing only using a small bitfield, and getting false matches.

(Actually, if you think about it, resonance in the cache is like aliasing, related to the use of bitfields. Resonance is caused by multiple cache lines mapping the same set, not being spread arond. Alisaing is caused by matching based on incomplete address bits.)


Overall, my recommendation for tuning:

  1. Try cache blocking without any further analysis. I say this because cache blocking is easy, and it is very likely that this is all you would need to do.

  2. After that, use VTune or OProf. Or Cachegrind. Or ...

  3. Better yet, use a well tuned library routine to do matrix multiply.

Duvall answered 28/4, 2012 at 15:26 Comment(3)
Very interesting answer (+1) but terrible formatting and editing :) I did my best to improve it a little.Zucker
Nice. little typo : 256 cache lines instead of 128.Hatred
Thanks for catching that: 2^8 = 256. I'll try to correct, but I bet I don't catch all of the dependencies. Back when I worked at Intel I wrote a little "Free Text Spreadsheet", that allowed formulae to be placed in the text: type in a new number, and the fix propagated. (I wrote that in undergrad; maybe I can revive.)Duvall
M
18

There are several possible explanations. One likely explanation is what Mysticial suggests: exhaustion of a limited resource (either cache or TLB). Another likely possibility is a false aliasing stall, which can occur when consecutive memory accesses are separated by a multiple of some power-of-two (often 4KB).

You can start to narrow down what's at work by plotting time/dimension^3 for a range of values. If you have blown a cache or exhausted TLB reach, you will see a more-or-less flat section followed by a sharp rise between 2000 and 2048, followed by another flat section. If you are seeing aliasing-related stalls, you will see a more-or-less flat graph with a narrow spike upward at 2048.

Of course, this has diagnostic power, but it is not conclusive. If you want to conclusively know what the source of the slowdown is, you will want to learn about performance counters, which can answer this sort of question definitively.

Mckinley answered 26/10, 2011 at 16:41 Comment(1)
+1, I've never even heard of false-aliasing stalls in this context. But thinking from the hardware design side, it makes sense.Putamen
I
10

I know this is waaaay too old, but I'll take a bite. It's (as it has been said) a cache issue what causes the slowdown at around powers of two. But there is another problem with this: it's too slow. If you look at your compute loop.

for(i = 0; i < dimension; i++)
    for(j = 0; j < dimension; j++)
        for(k = 0; k < dimension; k++)
            C[dimension*i+j] += A[dimension*i+k] * B[dimension*k+j];

The inner-most loop changes k by 1 each iteration, meaning you access just 1 double away from the last element you used of A but a whole 'dimension' doubles away from the last element of B. This doesn't take any advantage of the caching of the elements of B.

If you change this to:

for(i = 0; i < dimension; i++)
    for(j = 0; j < dimension; j++)
        for(k = 0; k < dimension; k++)
            C[dimension*i+k] += A[dimension*i+j] * B[dimension*j+k];

You get the exact same results (modulo double addition associativity errors), but it's a whole lot more cache-friendly (local). I tried it and it gives substantial improvements. This can be summarized as

Don't multiply matrices by definition, but rather, by rows


Example of the speed-up (I changed your code to take the dimension as an argument)

$ diff a.c b.c
42c42
<               C[dimension*i+j] += A[dimension*i+k] * B[dimension*k+j];
---
>               C[dimension*i+k] += A[dimension*i+j] * B[dimension*j+k];
$ make a
cc     a.c   -o a
$ make b
cc     b.c   -o b
$ ./a 1024

secs:88.732918
$ ./b 1024

secs:12.116630

As a bonus (and what makes this related to this question) is that this loop doesn't suffer from the previous problem.

If you already knew all of this, then I apologize!

Indium answered 1/10, 2013 at 12:38 Comment(1)
+1 A better algorithm always makes a bigger difference - regardless of what kind of cache (or even if there is one) this is faster.Laciniate
B
9

A couple of answers mentioned L2 Cache problems.

You can actually verify this with a cache simulation. Valgrind's cachegrind tool can do that.

valgrind --tool=cachegrind --cache-sim=yes your_executable

Set the command line parameters so they match with your CPU's L2 parameters.

Test it with different matrix sizes, you'll probably see a sudden increase in L2 miss ratio.

Bellay answered 26/10, 2011 at 18:10 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.