Understanding how to write cache-friendly code

Asked 30/12, 2013 at 7:38 Answered 30/6, 2024 at 0:25

I have been trying to understand how to write the cache-friendly code. So as a first step, i was trying to understand the performance difference between array row-major access and column major access.

So I created an int array of size 512×512 so that total size is 1MB. My L1 cache is 32KB, L2 cache is 256KB, and L3 cache is 3MB. So my array fits in L3 cache.

I simply calculated the sum of array elements in row major order and column major order and compared their speed. All the time, column major order is slightly faster. i expected row major order to be faster than the other (may be several times faster).

I thought problem may be due to small size of array, so I made another array of size 8192×8192 (256 MB). Still the same result.

Below is the code snippet I used:

#include "time.h"
#include <stdio.h>

#define S 512
#define M S
#define N S

int main() {
    // Summing in the row major order
    int x = 0;
    int iter = 25000;
    int i, j;
    int k[M][N];
    int sum = 0;    
    clock_t start, end;

    start = clock();
    while(x < iter) {
        for (i = 0; i < M; i++) {
            for(j = 0; j < N; j++) {
                sum += k[i][j];
            }
        }

        x++;
    }
    end = clock();
    printf("%i\n", end-start);

    // Summing in the column major order
    x = 0;
    sum = 0;
    int h[M][N];

    start = clock();
    while(x < iter) {
        for (j = 0; j < N; j++) {
            for(i = 0; i < M; i++){
                sum += k[i][j];
            }
        }

        x++;
    }
    end = clock();
    printf("%i\n", end-start);
}

Question : can some one tell me what is my mistake and why I am getting this result?

Driedup answered 30/12, 2013 at 7:38 Comment(4)

thank you @MOHAMED for your formatting – Driedup 30/12, 2013 at 7:46

Can you post your results (the actual count values)? I'm not sure myself but I've found that people use printf statements to make sure the compiler doesn't optimize out unused parts. Since your "matrices" are uninitialized, summed up and not used they may be getting optimized out? – Jugoslavia 30/12, 2013 at 8:31

Did you try changing the order in which both cases are computed ? – Patronymic 30/12, 2013 at 8:39

Make sure you compile without any optimization flags. Compilers are smart enough to reverse the loops (make inner to be the outer one). – Redcap 30/12, 2013 at 8:49

I don't really know why you get this behaviour, but let me clarify some things.

There are at least 2 things to consider when thinking about cache: cache size and cache line size. For instance, my Intel i7 920 processor has a 256KB L2 Cache with 64 bytes line size. If your data fits inside the cache, then it really doesn't matter in which order you access it. All the problems of optimizing a code to be cache friendly must target 2 things: if possible split the access to the memory in blocks such in a way that a block fits in cache. Do all the computations possible with that block and then bring the next block, do the computations with it and so on. The other thing, (the one you are trying to do) is to access the memory in a consecutive way. When you request a data from the memory (lets say an int - 4 bytes) a whole cache line is brought to the cache (in my case 64 bytes: that is 16 adjacent integers (including the one you requested) are brought to cache). Here comes in play row-order vs column-order. With row order you have 1 cache miss for every 16 memory requests, with column order you get a cache-miss for every request (but only if your data doesn't fit in cache; if your data fits in cache, then you get the same ratio as with row-order because you still have the lines in cache, from way back when you requested the first element in the line; of course associativeness can come into play and a cache line can be rewritten even if not all cache is filled with your data).

Regarding your problem, when the data fits in cache, as I said, the access order doesn't matter that much, but when you do the second summing, the data is already in the cache from when you did the first sum, so that's why it is faster. If you do the column-order sum first you should see that the row-order sum becomes faster simply because is done after. However, when the data is large enough, you shouldn't get the same behaviour. Try the following: between the two sums, do something with another large data in order to invalidate the whole cache.

Edit

I see a 3-4x speedup for row major (although I expected >8x speedup. any idea why?). [..] it would be great if you could tell me why speedup is only 3x

Is not that accessing the matrix the "right way" doesn't improve much, is more like accessing the matrix the "wrong way" doesn't hurt that much, if that makes any sense.

Although I can't provide you with a specific and exact answer, what I can tell you is that modern processors have very complicated and extremely efficient cache models. They are so powerful that, for instance, in many common cases they can mask the cache levels, making to seem like instead of 3 level cache you have a big one level cache (you don't see a penalty when increasing your data size from a size that fits in L2 to a size that fits only in L3). Running your code in an older processor (lets say 10 years old) probably you will see the speedup you expect. Modern day processors however have mechanisms that help a lot with cache misses. Desktop processors are design with the philosophy of running "bad code" fast so a lot of investment is made in improving "bad code" performance because the vast majority of desktop applications aren't written by people who understand branching issues or cache models. This is opposed to the high-performance market where specialized processors make a bad code hurt very much because they implement weak mechanisms that deal with "bad code" (or don't implement at all). These mechanisms take up a lot of transistors and so they increase the power consumption and the heat generated, but they are worth implementing in a desktop processor where most of the code is "bad code".

Redcap answered 30/12, 2013 at 9:8 Comment(2)

+1 - i think you are right about cleaning up the cache between two operations. it worked, and i see a 3-4x speed up for row major (although I expected >8x speedup. any idea why?). by the way, I am accepting your answer since it solved my original problem. it would be great if you could tell me why speedup is only 3x. – Driedup 30/12, 2013 at 10:40

@AbidRahmanK See edit. It doesn't give you an exact answer, but it shows in a general way why the speedup is not what you expected. – Redcap 30/12, 2013 at 11:22

Although, this question is 11y old and already marked as answered, I found the OP's problem super interesting and not explained well. I've taken a look at the code with different optimization levels and found few things.

With GCC's -O0 (which is the default one) I'm getting the same results as OP had:

$ gcc -O0 main.c 
$ ./a.out
10964492
9092597

but -0g shows even bigger difference:

$ gcc -Og main.c
$ ./a.out
3203853
1639167

Despite code running faster in -0g, the delta time is the same compared to -00. This most likely indicates there is small overhead during 1st summing. I bet this is because 2nd summing explicitly reuses variables from the 1st one.

-O1 shows almost no delta time between summings, however, in -02 and -03 almost whole code gets optimized away resulting in instant execution:

$ gcc -O2 main.c
$ ./a.out
1
1

The disassembly shows that the loops are totally gone. Only calls to printf, clock and their setup are still there, nothing else.

Now, the funny part, why is this happening and how to prevent it, how to make the actual measurement and get the correct results.

First of all, OP's code has almost no side effects and uses uninitialized memory which causes undefined behavior. This creates a perfect opportunity for compiler to optimize everything, even during early stages of compilation.

Another thing is that stride between columns is really small. OP already mentioned in his question that the whole array may fit into L3 cache (which is probably true). One cache line is usually 64 bytes [reference] so even when "iterating by the columns" the lines may still be in cache unless some other process overrides them.

The other thing is that both buffers are allocated on the stack and since they're uninitialized they may even share the same memory block.

Considering what mentioned above I've tweaked the code a bit. I introduced more side effects by printing the results which should prevent aggressive optimizations. The buffers are stored on the heap and memory is initialized (though they seem to be on the same index all the time so this is being optimized either by kernel or libc). Strides are also bigger. The only problem is that summing causes integer overflows but this is well defined on x86_64 (value should wrap around in this case) and sum values are always the same.

#include <stdlib.h>
#include <stdio.h>
#include <time.h>

#define M 12345
#define N 54321

int main(void)
{
    { // Summing in the row major order
        int *k = malloc(sizeof(int) * M * N);
        printf("k: %ld\n", (size_t)k);
        const clock_t start = clock();
        for (int x = 0; x < M * N; ++x)
            k[x] = x;
        int sum = 0;
        for (int i = 0; i < M; i++)
            for(int j = 0; j < N; j++)
                sum += k[i*N + j]; // <----- pay attention to this line
        printf("sum: %d\n", sum);
        const clock_t end = clock();
        printf("time: %f\n", (float)(end-start)/(1000000));
        free(k);
    }
    { // Summing in the column major order
        int *k = malloc(sizeof(int) * M * N);
        printf("k: %ld\n", (size_t)k);
        const clock_t start = clock();
        for (int x = 0; x < M * N; ++x)
            k[x] = x;
        int sum = 0;
        for (int i = 0; i < M; i++)
            for(int j = 0; j < N; j++)
                sum += k[i + j*M]; // <----- pay attention to this line
        printf("sum: %d\n", sum);
        const clock_t end = clock();
        printf("time: %f\n", (float)(end-start)/(1000000));
        free(k);
    }
}

Results appear correct and the differences between cache hits and misses are noticeable across different optimizations:

$ gcc -O0 main2.c
$ ./a.out
k: 136416159858704
sum: -188591980
time: 2.688405
k: 136416159858704
sum: -188591980
time: 4.117664

$ gcc -O2 main2.c
$ ./a.out
k: 138602927357968
sum: -188591980
time: 0.626179
k: 138602927357968
sum: -188591980
time: 1.408703

Swapping the order of summings also produces expected results:

$ gcc -O0 main3.c
$ ./a.out
k: 130537320611856
sum: -188591980
time: 4.169673
k: 130537320611856
sum: -188591980
time: 2.678053

However, -03 again seems to optimize the loops. They're not entirely gone like with the OP's code but the access pattern has been optimized:

$ gcc -O3 main2.c
$ ./a.out
k: 131184013082640
sum: -188591980
time: 0.366076
k: 131184013082640
sum: -188591980
time: 0.363325

I guess, the code is too simple and GCC is just way too smart in -03 :)

I used GCC version 14.1.1 20240522 with CPU Intel i5-8600K 4.30GHz 6c/6t on Linux kernel 6.9.6.

I hope someone will find this useful someday just like I did.

Cheers!

Weatherwise answered 30/6, 2024 at 0:25 Comment(1)

Wow.. Thanks for looking into this after 11 yrs. I haven't gone through your response yet. A little busy right now. I will definitely go through it soon. – Driedup 30/6, 2024 at 5:32

Recommended topics

Hot tags