Why is it worse to initialize a two dimensional array like this?
Asked Answered
P

4

12
for(int i = 0; i<100; i++)

    for(int j = 0; j<100; j++)

         array[j][i] = 0;
         // array[i][j] = 0;

My professor said it was much more costly to initialize a two dimensional array in the first way as opposed to the second. Can someone explain what is going on underneath the hood which makes that the case? Or, do the two means of initialization have equal performance?

Pterosaur answered 22/6, 2012 at 0:6 Comment(5)
Locality of reference: you're needlessly invalidating CPU cache in the "slow" way.Solve
@dlev: why wouldn't you post this as an answer?Tyrrell
because dlev isn't about the rep. dlev is about the loveAstronomer
This helpful document describes memory locality and many other facts in great detail: akkadia.org/drepper/cpumemory.pdfLe
Your professor is probably talking about access patterns for arrays, and not of initialization. Initialization (at time of declaration) has its own syntax (double array[100][100] = { 0 };) for which the realization in modern compilers probably "outperforms" anything that is said, here.Wallop
O
24

As @dlev mentioned, this is due to locality of reference and has to do with how the physical hardware in the computer works.

Inside the computer, there are many different types of memory. Typically, only certain memory locations (registers) can have actual operations performed on them; the rest of the time, if you're performing operations on data, you have to load it from memory into a register, perform some computation, then write it back.

Main memory (RAM) is much, much slower than registers, often by a factor of hundreds to thousands. Consequently, reading from memory should be avoided if at all possible. To address this, most computers typically have special regions of memory called caches. The job of the cache is to hold data that has recently been accessed from memory such that if that same memory region is accessed again, the value can be pulled from the cache (fast) rather than from main memory (slow). Typically, caches are designed so that if a value is read in from memory, that value, plus a whole bunch of adjacent values, are pulled into the cache. That way, if you iterate over an array, then after reading the first value, the rest of the values from the array will be sitting in the cache and can be accessed more efficiently.

The reason that your code is slower than it needs to be is that it doesn't access the array elements sequentially. In C, 2D arrays are laid out in row-major order, meaning that the memory is arranged as

A[0][0] A[0][4] A[0][5] ... A[1][0] A[1][6] A[1][7] ... A[2][0] A[2][8] A[2][9] ...

Consequently, if you use this for loop:

for (int i = 0; i < N; i++) {
    for (int j = 0; j < M; j++) {
        // Do something with A[i][j]
    }
}

Then you get excellent locality, because you will be accessing array elements in the order in which they appear in memory. This makes the number of reads of main memory very small, since everything is typically in cache and ready to go.

However, if you interchange the loops, as you've done, your accesses jump around in memory and are not necessarily consecutive. This means that you will have a lot of cache misses in which the memory address you read next isn't in the cache. This increases the number of cache loads, which can dramatically slow down the program.

Compilers are starting to get smart enough to interchange loops like this automatically, but we're still a ways away from being able to ignore these details. As a general rule, when writing C or C++ code for multidimensional arrays, try to iterate in row-major order rather than column-major order. You can get noticeable speedups in your program.

Hope this helps!

Ossuary answered 22/6, 2012 at 0:13 Comment(4)
And you expect me to believe this was written in 8 minutes? pfft. (A very nice answer.)Clouet
@pst- I teach a compilers course each summer and was just right now reviewing my slides, so all of this was fresh in my memory. (I just realized that this means I could type it out fast because it was in cache... spooky...)Ossuary
maybe not applicable here, but you could also unroll the loop a bit, or let the compiler do it for you to also get some potential speed up. This could also let the cache and pipeline have some advantages. Of course then you also probably have some intimate knowledge of the hardware on which you are running.Herren
I seem to recall there's a section in "The Mythical Man-Month" (top-notch book. Everyone's read it, right?) where Fred Brooks goes over the work done at IBM regarding locality-of-reference and its bearing on the potential practicality of virtual memory. (And I also recall RADM Grace Murray Hopper relating that when she invented it she called it "using auxiliary storage". When Honeywell developed it they called it "using auxiliary storage". But, she grumbled, it was only widely adopted when IBM took to calling it "virtual memory").Shive
H
4

I'll probably get downvoted for this, but if you are programming C, then the "best" is most likely:

memset(array, 0, sizeof(array));

Then you can defer all responsibility of optimizing (which you are obviously worried about) to the implementation of memset. Any specific hardware advantages can be done there.

http://en.wikipedia.org/wiki/Sizeof#Using_sizeof_with_arrays/

http://www.cplusplus.com/reference/clibrary/cstring/memset/

Another observation is that if you are init'ing to zero, ask yourself why? If your array is static (which for this large it probably is?), then cstartup will initialize to zero for you. Again, this will probably use the most efficient way for your hardware.

Herren answered 22/6, 2012 at 1:26 Comment(3)
+1 - In C a call to a standard library function is ALWAYS in order.Shive
In c using standard constructs compared to library functions is even better: there is a syntax for initialization of arrays.Wallop
@Josh - The compilers I use all understand that a loop assigning zero to an array is initialization. The resulting code is no different from using memset (which is also "known").Leprosarium
P
4

I'm a bit late to the party, and there is an excellent answer already. However, I thought I could contribute by demonstrating how one could answer this question experimentally using a profiling tool (on Linux).

I'll use the perf tool in the Ubuntu 10.10 package linux-tools-common.

Here's the little C program I wrote to answer this question:

// test.c
#define DIM 1024

int main()
{
    int v[DIM][DIM];
    unsigned i, j;

    for (i = 0; i < DIM; i++) {
        for (j = 0; j < DIM; j++) {
#ifdef ROW_MAJOR_ORDER
            v[i][j] = 0;
#else
            v[j][i] = 0;
#endif
        }
    }

    return 0;
}

Then compile the two different versions:

$ gcc test.c -O0 -DROW_MAJOR_ORDER -o row-maj
$ gcc test.c -O0 -o row-min

Note I've disabled optimization with -O0 so gcc has no chance to rearrange our loop to be more efficient.

We can list the performance statistics available with perf by doing perf list. In this case, we are interested in cache misses which is the event cache-misses.

Now it's as simple as running each version of the program numerous times and taking an average:

$ perf stat -e cache-misses -r 100 ./row-min

 Performance counter stats for './row-min' (100 runs):

             286468  cache-misses               ( +-   0.810% )

        0.016588860  seconds time elapsed   ( +-   0.926% )

$ perf stat -e cache-misses -r 100 ./row-maj

 Performance counter stats for './row-maj' (100 runs):

               9594  cache-misses               ( +-   1.203% )

        0.006791615  seconds time elapsed   ( +-   0.840% )

And now we've experimentally verified that you do in fact see two orders of magnitude more cache misses with the "row-minor" version.

Peignoir answered 10/7, 2012 at 14:38 Comment(0)
T
2

If you look at the memory locations accessed by each technique, the second will access consecutive bytes, while the first will hop around by 100-byte leaps. The memory cache will work much more efficiently if you do it the second way.

Tyrrell answered 22/6, 2012 at 0:13 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.