Although, this question is 11y old and already marked as answered, I found the OP's problem super interesting and not explained well. I've taken a look at the code with different optimization levels and found few things.
With GCC's -O0
(which is the default one) I'm getting the same results as OP had:
$ gcc -O0 main.c
$ ./a.out
10964492
9092597
but -0g
shows even bigger difference:
$ gcc -Og main.c
$ ./a.out
3203853
1639167
Despite code running faster in -0g
, the delta time is the same compared to -00
. This most likely indicates there is small overhead during 1st summing. I bet this is because 2nd summing explicitly reuses variables from the 1st one.
-O1
shows almost no delta time between summings, however, in -02
and -03
almost whole code gets optimized away resulting in instant execution:
$ gcc -O2 main.c
$ ./a.out
1
1
The disassembly shows that the loops are totally gone. Only calls to printf
, clock
and their setup are still there, nothing else.
Now, the funny part, why is this happening and how to prevent it, how to make the actual measurement and get the correct results.
First of all, OP's code has almost no side effects and uses uninitialized memory which causes undefined behavior. This creates a perfect opportunity for compiler to optimize everything, even during early stages of compilation.
Another thing is that stride between columns is really small. OP already mentioned in his question that the whole array may fit into L3 cache (which is probably true). One cache line is usually 64 bytes [reference] so even when "iterating by the columns" the lines may still be in cache unless some other process overrides them.
The other thing is that both buffers are allocated on the stack and since they're uninitialized they may even share the same memory block.
Considering what mentioned above I've tweaked the code a bit. I introduced more side effects by printing the results which should prevent aggressive optimizations. The buffers are stored on the heap and memory is initialized (though they seem to be on the same index all the time so this is being optimized either by kernel or libc). Strides are also bigger. The only problem is that summing causes integer overflows but this is well defined on x86_64 (value should wrap around in this case) and sum values are always the same.
#include <stdlib.h>
#include <stdio.h>
#include <time.h>
#define M 12345
#define N 54321
int main(void)
{
{ // Summing in the row major order
int *k = malloc(sizeof(int) * M * N);
printf("k: %ld\n", (size_t)k);
const clock_t start = clock();
for (int x = 0; x < M * N; ++x)
k[x] = x;
int sum = 0;
for (int i = 0; i < M; i++)
for(int j = 0; j < N; j++)
sum += k[i*N + j]; // <----- pay attention to this line
printf("sum: %d\n", sum);
const clock_t end = clock();
printf("time: %f\n", (float)(end-start)/(1000000));
free(k);
}
{ // Summing in the column major order
int *k = malloc(sizeof(int) * M * N);
printf("k: %ld\n", (size_t)k);
const clock_t start = clock();
for (int x = 0; x < M * N; ++x)
k[x] = x;
int sum = 0;
for (int i = 0; i < M; i++)
for(int j = 0; j < N; j++)
sum += k[i + j*M]; // <----- pay attention to this line
printf("sum: %d\n", sum);
const clock_t end = clock();
printf("time: %f\n", (float)(end-start)/(1000000));
free(k);
}
}
Results appear correct and the differences between cache hits and misses are noticeable across different optimizations:
$ gcc -O0 main2.c
$ ./a.out
k: 136416159858704
sum: -188591980
time: 2.688405
k: 136416159858704
sum: -188591980
time: 4.117664
$ gcc -O2 main2.c
$ ./a.out
k: 138602927357968
sum: -188591980
time: 0.626179
k: 138602927357968
sum: -188591980
time: 1.408703
Swapping the order of summings also produces expected results:
$ gcc -O0 main3.c
$ ./a.out
k: 130537320611856
sum: -188591980
time: 4.169673
k: 130537320611856
sum: -188591980
time: 2.678053
However, -03
again seems to optimize the loops. They're not entirely gone like with the OP's code but the access pattern has been optimized:
$ gcc -O3 main2.c
$ ./a.out
k: 131184013082640
sum: -188591980
time: 0.366076
k: 131184013082640
sum: -188591980
time: 0.363325
I guess, the code is too simple and GCC is just way too smart in -03
:)
I used GCC version 14.1.1 20240522
with CPU Intel i5-8600K 4.30GHz 6c/6t
on Linux kernel 6.9.6
.
I hope someone will find this useful someday just like I did.
Cheers!
printf
statements to make sure the compiler doesn't optimize out unused parts. Since your "matrices" are uninitialized, summed up and not used they may be getting optimized out? – Jugoslavia