As @dlev mentioned, this is due to locality of reference and has to do with how the physical hardware in the computer works.
Inside the computer, there are many different types of memory. Typically, only certain memory locations (registers) can have actual operations performed on them; the rest of the time, if you're performing operations on data, you have to load it from memory into a register, perform some computation, then write it back.
Main memory (RAM) is much, much slower than registers, often by a factor of hundreds to thousands. Consequently, reading from memory should be avoided if at all possible. To address this, most computers typically have special regions of memory called caches. The job of the cache is to hold data that has recently been accessed from memory such that if that same memory region is accessed again, the value can be pulled from the cache (fast) rather than from main memory (slow). Typically, caches are designed so that if a value is read in from memory, that value, plus a whole bunch of adjacent values, are pulled into the cache. That way, if you iterate over an array, then after reading the first value, the rest of the values from the array will be sitting in the cache and can be accessed more efficiently.
The reason that your code is slower than it needs to be is that it doesn't access the array elements sequentially. In C, 2D arrays are laid out in row-major order, meaning that the memory is arranged as
A[0][0] A[0][4] A[0][5] ... A[1][0] A[1][6] A[1][7] ... A[2][0] A[2][8] A[2][9] ...
Consequently, if you use this for loop:
for (int i = 0; i < N; i++) {
for (int j = 0; j < M; j++) {
// Do something with A[i][j]
}
}
Then you get excellent locality, because you will be accessing array elements in the order in which they appear in memory. This makes the number of reads of main memory very small, since everything is typically in cache and ready to go.
However, if you interchange the loops, as you've done, your accesses jump around in memory and are not necessarily consecutive. This means that you will have a lot of cache misses in which the memory address you read next isn't in the cache. This increases the number of cache loads, which can dramatically slow down the program.
Compilers are starting to get smart enough to interchange loops like this automatically, but we're still a ways away from being able to ignore these details. As a general rule, when writing C or C++ code for multidimensional arrays, try to iterate in row-major order rather than column-major order. You can get noticeable speedups in your program.
Hope this helps!
double array[100][100] = { 0 };
) for which the realization in modern compilers probably "outperforms" anything that is said, here. – Wallop