How can I optimize these loops (with compiler optimization disabled)?

Asked 14/8, 2015 at 1:20 Answered 4/11, 2016 at 15:54

Solved c loops optimization compiler-optimization

I need to optimize some for loops for speed (for a school assignment) without using compiler optimization flags.

Given a specific Linux server (owned by the school), a satisfactory improvement is to make it run under 7 seconds, and a great improvement is to make it run under 5 seconds. This code that I have right here gets about 5.6 seconds. I am thinking I may need to use pointers with this in some way to get it to go faster, but I'm not really sure. What options do I have?

The file must remain 50 lines or less (not counting comments).

#include <stdio.h>
#include <stdlib.h>

// You are only allowed to make changes to this code as specified by the comments in it.

// The code you submit must have these two values.
#define N_TIMES        600000
#define ARRAY_SIZE     10000

int main(void)
{
    double    *array = calloc(ARRAY_SIZE, sizeof(double));
    double    sum = 0;
    int        i;

    // You can add variables between this comment ...
    register double sum1 = 0, sum2 = 0, sum3 = 0, sum4 = 0, sum5 = 0, sum6 = 0, sum7 = 0, sum8 = 0, sum9 = 0;
    register int j;
    // ... and this one.

    printf("CS201 - Asgmt 4 - \n");

    for (i = 0; i < N_TIMES; i++)
    {
        // You can change anything between this comment ...
        for (j = 0; j < ARRAY_SIZE; j += 10)
        {
            sum += array[j];
            sum1 += array[j + 1];
            sum2 += array[j + 2];
            sum3 += array[j + 3];
            sum4 += array[j + 4];
            sum5 += array[j + 5];
            sum6 += array[j + 6];
            sum7 += array[j + 7];
            sum8 += array[j + 8];
            sum9 += array[j + 9];
        }
        // ... and this one. But your inner loop must do the same
        // number of additions as this one does.
    }

    // You can add some final code between this comment ...
    sum += sum1 + sum2 + sum3 + sum4 + sum5 + sum6 + sum7 + sum8 + sum9;
    // ... and this one.

    return 0;
}

Biconcave answered 14/8, 2015 at 1:20 Comment(5)

do you have openMP available on the server? also, why do you have sum+=array[j] in the loop if you have the big sum at the end? ...also... the sum is always 0 – Waistcoat 14/8, 2015 at 1:23

Since all the variables and array elements are zero (see calloc), you can replace the entire inner loop (the j one) body with (keeping 19 additions) sum = 0 + 0 + 0 + 0 + 0 + 0 + 0 + 0 + 0 + 0 + 0 + 0 + 0 + 0 + 0 + 0 + 0 + 0 + 0 + 0 :-) – Wame 14/8, 2015 at 1:34

We are required to use the array for computing the "sum." Yes it is all 0s but the point is to access the array the thousands of times that are required as fast as possible. and for our linux server, we use a command called time(executable) to determine the time it takes to run. Though you are probably right and I don't need the new sum at the end, I felt it was in the spirit of the problem to do so – Biconcave 14/8, 2015 at 1:39

Better asked, but a duplicate of https://mcmap.net/q/22989/-optimized-sum-of-an-array-of-doubles-in-c-duplicate/224132. If anything, we should close the older question. (after I copy my answer from there to this.) – Whimsical 14/8, 2015 at 2:1

The student in question has probably graduated and moved on, but problems of this type, students of CS are learning how to implement optimizations for a machine. Not how to feed an optimizer (that is a separate course). Tools such as the Compiler Explorer (godbolt.org) and the like are great at learning this sort of thing. The code can be examined and the instructions used by the machine clearly seen. Switch on optimizations to see the compiler at work and compare. It can be tricky to convince the optimizer to emit code for blatantly obvious source as used in this question, though. – Triclinic 4/4, 2020 at 0:40

You may be on the right track, though you'll need to measure it to be certain (my normal advice to measure, not guess seems a little superfluous here since the whole point of the assignment is to measure).

Optimising compilers will probably not see much of a difference since they're pretty clever about that sort of stuff but, since we don't know what optimisation level it will be compiling at, you may get a substantial improvement.

To use pointers in the inner loop is a simple matter of first adding a pointer variable:

register double *pj;

then changing the loop to:

for (pj = &(array[0]); pj < &(array[ARRAY_SIZE]); j++) {
        sum += *j++;
        sum1 += *j++;
        sum2 += *j++;
        sum3 += *j++;
        sum4 += *j++;
        sum5 += *j++;
        sum6 += *j++;
        sum7 += *j++;
        sum8 += *j++;
        sum9 += *j;
    }

This keeps the amount of additions the same within the loop (assuming you're counting += and ++ as addition operators, of course) but basically uses pointers rather than array indexes.

With no optimisation¹ on my system, this drops it from 9.868 seconds (CPU time) to 4.84 seconds. Your mileage may vary.

¹ With optimisation level -O3, both are reported as taking 0.001 seconds so, as mentioned, the optimisers are pretty clever. However, given you're seeing 5+ seconds, I'd suggest it wasn't been compiled with optimisation on.

As an aside, this is a good reason why it's usually advisable to write your code in a readable manner and let the compiler take care of getting it running faster. While my meager attempts at optimisation roughly doubled the speed, using -O3 made it run some ten thousand times faster :-)

Wame answered 14/8, 2015 at 1:44 Comment(9)

thank you so very much! I knew pointers were probably the next step for my code but I was implementing them wrong (was trying to use a similar structure to the arrays with j + 1, j + 2, etc. but I didn't realize it was just about incrementing one at a time! You've been a huge help man – Biconcave 14/8, 2015 at 1:56

I would agree with you, but our instructor specifically tells us to never use the compiler's optimization for the class and especially this assignment is about the optimization methods and thus, the compiler's optimization is worthless to me :) – Biconcave 14/8, 2015 at 1:59

Compilers are pretty clever, but they don't have to be that good to throw away computation of results that are never used. Not a very good homework assignment, IMO. – Whimsical 14/8, 2015 at 2:2

Yea this assignment was really iffy :/ Normally the assignments have more meat to them and make more sense practically. – Biconcave 14/8, 2015 at 2:21

@pax: With a non-optimising compiler, keeping the end-pointer in its own variable will make a difference. double *endp = array + ARRAY_SIZE before the loop. Otherwise gcc -O0 may emit code to compute array+ARRAY_SIZE every iteration. Just another example of why this is silly. Oh, and you'll probably also do better with j[0], j[1], ..., and then increment j by 10. Hopefully that will generate asm with [rsi], [rsi + 8], [rsi + 16], instead of loading j, incrementing, and storing for every double. – Whimsical 14/8, 2015 at 3:52

@pax: just saw you said you why you did ++j repeatedly. I assume the assignment was to keep all the FP adds, and reduce all other overhead. – Whimsical 14/8, 2015 at 8:20

This solution is called Duff's Device, IIRC. – Triclinic 4/4, 2020 at 0:42

@casualcoder, not quite. Duff's was a method of combining switch and do..while to unroll a loop that wasn't an exact multiple of the unroll constant, For example, if you unrolled a loop to have eight operations per iteration, but there were 28 operations needed, Duff's would allow you to easily do 8, 8, 8, and 4 operations in the four iterations, without needing any special pre- or post-processing. – Wame 4/4, 2020 at 2:39

@paxdiablo, you are absolutely right, I misremembered. en.wikipedia.org/wiki/Duff's_device – Triclinic 10/4, 2020 at 0:48

I am reposting a modified version of my answer from optimized sum of an array of doubles in C, since that question got voted down to -5. The OP of the other question phrased it more as "what else is possible", so I took him at his word and info-dumped about vectorizing and tuning for current CPU hardware. :)

The OP of that question eventually said he wasn't allowed to use compiler options higher than -O0, which I guess is the case here, too.

Summary:

Why using -O0 distorts things (unfairly penalizes things that are fine in normal code for a normal compiler). Using -O0 (the gcc/clang default) so your loops don't optimize away is not a valid excuse or a useful way to find out what will be faster with normal optimization enabled. (See also Idiomatic way of performance evaluation? for more about benchmark methods and pitfalls, like ways to enable optimization but still stop the compiler from optimizing away the work you want to measure.)
Stuff that's wrong with the assignment.
Types of optimizations. FP latency vs. throughput, and dependency chains. Link to Agner Fog's site. (Essential reading for optimization).
Experiments getting the compiler to optimize it (after fixing it to not optimize away). Best result with auto-vectorization (no source changes): gcc: half as fast as an optimal vectorized loop. clang: same speed as a hand-vectorized loop.
Some more comments on why bigger expressions are a perf win with -O0 only.
Source changes to get good performance without -ffast-math, making the code closer to what we want the compiler to do. Also some rules-lawyering ideas that would be useless in the real-world.
Vectorizing the loop with GCC architecture-neutral vectors, to see how close the auto-vectorizing compilers came to matching the performance of ideal asm code (since I checked the compiler output).

I think the point of the assignment is to sort of teach assembly-language performance optimizations using C with no compiler optimizations. This is silly. It's mixing up things the compiler will do for you in real life with things that do require source-level changes.

See Why does clang produce inefficient asm with -O0 (for this simple floating point sum)?

-O0 doesn't just "not optimize", it makes the compiler store variables to memory after every statement instead of keeping them in registers. It does this so you get the "expected" results if you set a breakpoint with gdb and modify the value (in memory) of a C variable. Or even if you jump to another line in the same function. So each C statement has to be compiled to an independent block of asm that starts and ends with all variables in memory. For a modern portable compiler like gcc which already transforms through multiple internal representations of program flow on the way from source to asm, this part of -O0 requires explicitly de-optimizing its graph of data flow back into separate C statements. These store/reloads lengthen every loop-carried dependency chain so it's horrible for tiny loops if the loop counter is kept in memory. (e.g. 1 cycle per iteration for inc reg vs. 6c for inc [mem], creating a bottleneck on loop counter updates in tight loops).

With gcc -O0, the register keyword lets gcc keep a var in a register instead of memory, and thus can make a big difference in tight loops (Example on the Godbolt Compiler explorer). But that's only with -O0. In real code, register is meaningless: the compiler attempts to optimally use the available registers for variables and temporaries. register is already deprecated in ISO C++11 (but not C11), and there's a proposal to remove it from the language along with other obsolete stuff like trigraphs.

With an extra variables involved, -O0 hurts array indexing a bit more than pointer incrementing.

Array indexing usually makes code easier to read. Compilers sometimes fail to optimize stuff like array[i*width + j*width*height], so it's a good idea to change the source to do the strength-reduction optimization of turning the multiplies into += adds.

At an asm level, array indexing vs. pointer incrementing are close to the same performance. (x86 for example has addressing modes like [rsi + rdx*4] which are as fast as [rdi]. except on Sandybridge and later.) It's the compiler's job to optimize your code by using pointer incrementing even when the source uses array indexing, when that's faster.

For good performance, you have to be aware of what compilers can and can't do. Some optimizations are "brittle", and a small seemingly-innocent change to the source will stop the compiler from doing an optimization that was essential for some code to run fast. (e.g. pulling a constant computation out of a loop, or proving something about how different branch conditions are related to each other, and simplifying.)

Besides all that, it's a crap sample because it doesn't have anything to stop a smart compiler from optimizing away the entire thing. It doesn't even print the sum. Even gcc -O1 (instead of -O3) threw away some of the looping.

(You can fix this by printing sum at the end. gcc and clang don't seem to realize that calloc returns zeroed memory, and optimize it away to 0.0. See my code below.)

Normally you'd put your code in a function, and call it in a loop from main() in another file. And compile them separately, without whole-program cross-file optimisation, so the compiler can't do optimisations based on the compile-time constants you call it with. The repeat-loop being wrapped so tightly around the actual loop over the array is causing havoc with gcc's optimizer (see below).

Also, the other version of this question had an uninitialized variable kicking around. It looks like long int help was introduced by the OP of that question, not the prof. So I will have to downgrade my "utter nonsense" to merely "silly", because the code doesn't even print the result at the end. That's the most common way of getting the compiler not to optimize everything away in a microbenchmark like this.

I assume your prof mentioned a few things about performance. There are a crapton of different things that could come into play here, many of which I assume didn't get mentioned in a 2nd-year CS class.

Besides multithreading with openmp, there's vectorizing with SIMD. There are also optimizations for modern pipelined CPUs: specifically, avoid having one long dependency chain.

Further essential reading:

Agner Fog's guides for optimizing C and asm for x86. Some of it applies to all CPUs.
What Every Programmer Should Know About Memory

Your compiler manual is also essential, esp. for floating point code. Floating point has limited precision, and is not associative. The final sum does depend on which order you do the additions in. Usually the difference in rounding error is small, so the compiler can get a big speedup by re-ordering things if you use -ffast-math to allow it.

Instead of just unrolling, keep multiple accumulators which you only add up at the end, like you're doing with the sum0..sum9 unroll-by-10. FP instructions have medium latency but high throughput, so you need to keep multiple FP operations in flight to keep the floating point execution units saturated.

If you need the result of the last op to be complete before the next one can start, you're limited by latency. For FP add, that's one per 3 cycles. In Intel Sandybridge, IvB, Haswell, and Broadwell, the throughput of FP add is one per cycle. So you need to keep at least 3 independent ops that can be in flight at once to saturate the machine. For Skylake, it's 2 per cycle with latency of 4 clocks. (On the plus side for Skylake, FMA is down to 4 cycle latency.)

In this case, there's also basic stuff like pulling things out of the loop, e.g. help += ARRAY_SIZE.

Compiler Options

Lets start by seeing what the compiler can do for us.

I started out with the original inner loop, with just help += ARRAY_SIZE pulled out, and adding a printf at the end so gcc doesn't optimize everything away. Let's try some compiler options and see what we can achieve with gcc 4.9.2 (on my i5 2500k Sandybridge. 3.8GHz max turbo (slight OC), 3.3GHz sustained (irrelevant for this short benchmark)):

gcc -O0 fast-loop-cs201.c -o fl: 16.43s performance is a total joke. Variables are stored to memory after every operation, and re-loaded before the next. This is a bottleneck, and adds a lot of latency. Not to mention losing out on actual optimisations. Timing / tuning code with -O0 is not useful.
-O1: 4.87s
-O2: 4.89s
-O3: 2.453s (uses SSE to do 2 at once. I'm of course using a 64bit system, so hardware support for -msse2 is baseline.)
-O3 -ffast-math -funroll-loops: 2.439s
-O3 -march=sandybridge -ffast-math -funroll-loops: 1.275s (uses AVX to do 4 at once.)
-Ofast ...: no gain
-O3 -ftree-parallelize-loops=4 -march=sandybridge -ffast-math -funroll-loops: 0m2.375s real, 0m8.500s user. Looks like locking overhead killed it. It only spawns the 4 threads total, but the inner loop is too short for it to be a win: it collects the sums every time, instead of giving each thread 1/4 of the outer loop iterations.
-Ofast -fprofile-generate -march=sandybridge -ffast-math, run it, then -Ofast -fprofile-use -march=sandybridge -ffast-math: 1.275s. profile-guided optimization is a good idea when you can exercise all the relevant code-paths, so the compiler can make better unrolling / inlining decisions.
clang-3.5 -Ofast -march=native -ffast-math: 1.070s. (clang 3.5 is too old to support -march=sandybridge. You should prefer to use a compiler version that's new enough to know about the target architecture you're tuning for, esp. if using -march to make code that doesn't need to run on older architectures.)

gcc -O3 vectorizes in a hilarious way: The inner loop does 2 (or 4) iterations of the outer loop in parallel, by broadcasting one array element to all elements of an xmm (or ymm) register, and doing an addpd on that. So it sees the same values are being added repeatedly, but even -ffast-math doesn't let gcc just turn it into a multiply. Or switch the loops.

clang-3.5 vectorizes a lot better: it vectorizes the inner loop, instead of the outer, so it doesn't need to broadcast. It even uses 4 vector registers as 4 separate accumulators. It knows that calloc only returns 16-byte aligned memory (on x86-64 System V), and when tuning for Sandybridge (before Haswell) it knows that 32-byte loads have a big penalty when misaligned. And that splitting them isn't too expensive since a 32-byte load takes 2 cycles in a load port anyway.

vmovupd -0x60(%rbx,%rcx,8),%xmm4
vinsertf128 $0x1,-0x50(%rbx,%rcx,8),%ymm4,%ymm4

This is worse on later CPUs, especially when the data does happen to be aligned at run-time; see Why doesn't gcc resolve _mm256_loadu_pd as single vmovupd? about GCC versions where -mavx256-split-unaligned-load was on by default with -mtune=generic.

It's actually slower when I tell it that the array is aligned. (with a stupid hack like array = (double*)((ptrdiff_t)array & ~31); which actually generates an instruction to mask off the low 5 bits, because clang-3.5 doesn't support gcc's __builtin_assume_aligned.) In that case it uses a tight loop of 4x vaddpd mem, %ymm, %ymm. It only runs about 0.65 insns per cycle (and 0.93 uops / cycle), according to perf, so the bottleneck isn't front-end.

I checked with a debugger, and calloc is indeed returning a pointer that's an odd multiple of 16. (glibc for large allocations tends to allocate new pages, and put bookkeeping info in the initial bytes, always misaligning to any boundary wider than 16.) So half the 32B memory accesses are crossing a cache line, causing a big slowdown. It is slightly faster to do two separate 16B loads when your pointer is 16B-aligned but not 32B-aligned, on Sandybridge. (gcc enables -mavx256-split-unaligned-load and ...-store for -march=sandybridge, and also for the default tune=generic with -mavx, which is not so good especially for Haswell or with memory that's usually aligned by the compiler doesn't know about it.)

Source level changes

As we can see from clang beating gcc, multiple accumulators are excellent. The most obvious way to do this would be:

for (j = 0; j < ARRAY_SIZE; j+=4) {  // unroll 4 times
    sum0 += array[j];
    sum1 += array[j+1];
    sum2 += array[j+2];
    sum3 += array[j+3];
}

and then don't collect the 4 accumulators into one until after the end of the outer loop.

Your (from the other question) source change of

sum += j[0]+j[1]+j[2]+j[3]+j[4]+j[5]+j[6]+j[7]+j[8]+j[9];

actually has a similar effect, thanks to out-of-order execution. Each group of 10 is a separate dependency chain. order-of-operations rules say the j values get added together first, and then added to sum. So the loop-carried dependency chain is still only the latency of one FP add, and there's lots of independent work for each group of 10. Each group is a separate dependency chain of 9 adds, and takes few enough instructions for the out-of-order execution hardware to see the start of the next chain and, and find the parallelism to keep those medium latency, high throughput FP execution units fed.

With -O0, as your silly assignment apparently requires, values are stored to RAM at the end of every statement. Writing longer expressions without updating any variables, even temporaries, will make -O0 run faster, but it's not a useful optimisation. Don't waste your time on changes that only help with -O0, esp. not at the expense of readability.

Using 4 accumulator variables and not adding them together until the end of the outer loop defeats clang's auto-vectorizer. It still runs in only 1.66s (vs. 4.89 for gcc's non-vectorized -O2 with one accumulator). Even gcc -O2 without -ffast-math also gets 1.66s for this source change. Note that ARRAY_SIZE is known to be a multiple of 4, so I didn't include any cleanup code to handle the last up-to-3 elements (or to avoid reading past the end of the array, which would happen as written now). It's really easy to get something wrong and read past the end of the array when doing this.

GCC, on the other hand, does vectorize this, but it also pessimises (un-optimises) the inner loop into a single dependency chain. I think it's doing multiple iterations of the outer loop, again.

Using gcc's platform-independent vector extensions, I wrote a version which compiles into apparently-optimal code:

// compile with gcc -g -Wall -std=gnu11 -Ofast -fno-tree-vectorize -march=native fast-loop-cs201.vec.c -o fl3-vec

#include <stdio.h>
#include <stdlib.h>
#include <stddef.h>
#include <assert.h>
#include <string.h>

// You are only allowed to make changes to this code as specified by the comments in it.

// The code you submit must have these two values.
#define N_TIMES     600000
#define ARRAY_SIZE   10000

int main(void)
{
    double  *array = calloc(ARRAY_SIZE, sizeof(double));
    double  sum = 0;
    int     i;

    // You can add variables between this comment ...
    long int help = 0;

    typedef double v4df __attribute__ ((vector_size (8*4)));
    v4df sum0={0}, sum1={0}, sum2={0}, sum3={0};

    const size_t array_bytes = ARRAY_SIZE*sizeof(double);
    double *aligned_array = NULL;

    // this more-than-declaration could go in an if(i == 0) block for strict compliance with the rules
    if ( posix_memalign((void**)&aligned_array, 32, array_bytes) ) {
        exit (1);
    }
    memcpy(aligned_array, array, array_bytes);  // In this one case: faster to align once and have no extra overhead for N_TIMES through the loop

    // ... and this one.

    // Please change 'your name' to your actual name.
    printf("CS201 - Asgmt 4 - I. Forgot\n");

    for (i = 0; i < N_TIMES; i++) {

        // You can change anything between this comment ...
    /*
    #if defined(__GNUC__) && (__GNUC__ * 100 + __GNUC_MINOR__) >= 407 // GCC 4.7 or later.
        array = __builtin_assume_aligned(array, 32);
    #else
        // force-align for other compilers.  This loop-invariant will be done outside the loop.
        array = (double*) ((ptrdiff_t)array & ~31);
    #endif
    */

        assert ( ARRAY_SIZE / (4*4) == (ARRAY_SIZE+15) / (4*4) );  // We don't have a cleanup loop to handle where the array size isn't a multiple of 16

        // incrementing pointers can be more efficient than indexing arrays
        // esp. on recent Intel where micro-fusion only works with one-register addressing modes
        // of course, the compiler can always generate pointer-incrementing asm from array-indexing source
        const double *start = aligned_array;

        while ( (ptrdiff_t)start & 31 ) {
            // annoying loops like this are the reason people use aligned buffers
            sum += *start++;        // scalar until we reach 32B alignment
            // in practice, this loop doesn't run, because we copy into an aligned buffer
            // This will also require a cleanup loop, and break our multiple-of-16 doubles assumption.
        }

        const v4df *end = (v4df *)(aligned_array+ARRAY_SIZE);
        for (const v4df *p = (v4df *)start ; p+3 < end; p+=4) {
            sum0 += p[0];      // p+=4 increments the pointer by 4 * 4 * 8 bytes
            sum1 += p[1];       // make sure you keep track of what you're incrementing
            sum2 += p[2];
            sum3 += p[3];
        }

        // the compiler might be smart enough to pull this out of the inner loop
        // in fact, gcc turns this into a 64bit movabs outside of both loops :P
        help+= ARRAY_SIZE;

            // ... and this one. But your inner loop must do the same
            // number of additions as this one does.

        /* You could argue legalese and say that
         if (i == 0) {
             for (j ...)
                 sum += array[j];
             sum *= N_TIMES;
         }
         * still does as many adds in its *INNER LOOP*, but it just doesn't run it as often
         */
    }

    // You can add some final code between this comment ...
    sum0 = (sum0 + sum1) + (sum2 + sum3);
    sum += sum0[0] + sum0[1] + sum0[2] + sum0[3];
    printf("sum = %g; help=%ld\n", sum, help);  // defeat the compiler.

    free (aligned_array);
    free (array);  // not strictly necessary, because this is the end of main().  Leaving it out for this special case is a bad example for a CS class, though.
    // ... and this one.

    return 0;
}

The inner loop compiles to:

4007c0:  c5 e5 58 19     vaddpd (%rcx),%ymm3,%ymm3
4007c4:  48 83 e9 80     sub    $0xffffffffffffff80,%rcx # subtract -128, because
                                        # -128 fits in imm8 instead of requiring
                                        # an imm32 to encode add $128, %rcx
4007c8:  c5 f5 58 49 a0  vaddpd -0x60(%rcx),%ymm1,%ymm1  # one-register addressing
                                                         # mode can micro-fuse
4007cd:  c5 ed 58 51 c0  vaddpd -0x40(%rcx),%ymm2,%ymm2
4007d2:  c5 fd 58 41 e0  vaddpd -0x20(%rcx),%ymm0,%ymm0
4007d7:  4c 39 c1        cmp    %r8,%rcx                 # compare with end with p
4007da:  75 e4           jne    4007c0 <main+0xb0>

(For more, see online compiler output at the godbolt compiler explorer. The -xc compiler option compiles as C, not C++. The inner loop is from .L3 to jne .L3. See the x86 tag wiki for x86 asm links. See also this q&a about micro-fusion not happening on SnB-family, which Agner Fog's guides don't cover).

Performance on Sandybridge:

perf stat -e task-clock,cycles,instructions,r1b1,r10e,stalled-cycles-frontend,stalled-cycles-backend,L1-dcache-load-misses,cache-misses ./fl3-vec

Output:

CS201 - Asgmt 4 - I. Forgot
sum = 0; help=6000000000

Performance counter stats for './fl3-vec':

  1086.571078  task-clock (msec)       #    1.000 CPUs utilized
4,072,679,849  cycles                  #    3.748 GHz
2,629,419,883  instructions            #    0.65  insns per cycle
                                       #    1.27  stalled cycles per insn
4,028,715,968  r1b1                    # 3707.733 M/sec  # unfused uops
2,257,875,023  r10e                    # 2077.982 M/sec  # fused uops. Lower than insns because of macro-fusion
3,328,275,626  stalled-cycles-frontend #   81.72% frontend cycles idle
1,648,011,059  stalled-cycles-backend  #   40.47% backend  cycles idle
  751,736,741  L1-dcache-load-misses   #  691.843 M/sec
       18,772  cache-misses            #    0.017 M/sec

  1.086925466 seconds time elapsed

(With more modern perf, I'd have used uops_issued.any (fused-domain) and uops_executed.thread (unfused domain) instead of r10e and r1b1, respectively. Use perf list to see available events with descriptions on your CPU.)

The low instructions per cycle is a bottleneck on L2 cache bandwidth. The inner loop is using 4 separate accumulators, and I checked with gdb that the pointers are aligned. So cache-bank conflicts aren't the problem. Sandybridge L2 cache can transfer 32B in a cycle, which could keep up with the one 32B FP vector add per cycle. But L2 bandwidth can't sustain the peak 1 transfer per clock on Intel SnB / Haswell / Skylake CPUs. There aren't enough line-fill buffers to keep enough misses in flight to sustain the peak throughput every cycle, or some other limiter.

32B loads from L1 take 2 cycles (it wasn't until Haswell that Intel made 32B loads a single-cycle operation). However, there are 2 load ports, so the sustained throughput is 32B per cycle (which we're not reaching).

The perf counters indicate a fairly high L1 cache hit rate, so hardware prefetch from L2 to L1 seems to be doing its job.

0.65 instructions per cycle is only about half way to saturating the vector FP adder. IACA says the loop would run in 4 cycles per iteration, if the loads all hit in L1d cache. i.e. saturate the load ports and port1 (where the FP adder lives).

See also Single Threaded Memory Bandwidth on Sandy Bridge (Intel forum thread, with much discussion about what limits throughput, and how latency * max_concurrency is one possible bottleneck. See also the "Latency Bound Platforms" part of the answer to Enhanced REP MOVSB for memcpy limited memory concurrency is a bottleneck for loads as well as stores, but for loads prefetch into L2 does mean you might not be limited purely by Line Fill buffers for outstanding L1D misses.

Reducing ARRAY_SIZE to 1008 (multiple of 16), and increasing N_TIMES by a factor of 10, brought the runtime down to 0.5s. That's 1.68 insns per cycle. (The inner loop is 7 total instructions for 4 FP adds, thus we are finally saturating the vector FP add unit, and the load ports.) Loop tiling is a much better solution, see below.

Intel CPUs only have 32k each L1-data and L1-instruction caches. I think your array would just barely fit in the 64kiB L1D on an AMD K10 (Istanbul) CPU, but not Bulldozer-family (16kiB L1D) or Ryzen (32kiB L1D).

Gcc's attempt to vectorize by broadcasting the same value into a parallel add doesn't seem so crazy. If it had managed to get this right (using multiple accumulators to hide latency), that would have allowed it to saturate the vector FP adder with only half the memory bandwidth. As-is, it was pretty much a wash, probably because of overhead in broadcasting.

Also, it's pretty silly. The N_TIMES is a just a make-work repeat. We don't actually want to optimize for doing the identical work multiple times. Unless we want to win at silly assignments like this. A source-level way to do this would be to increment i in the part of the code we're allowed to modify:

for (...) {
    sum += a[j] + a[j] + a[j] + a[j];
}
i += 3;  // The inner loop does 4 total iterations of the outer loop

More realistically, to deal with this you could interchange your loops (loop over the array once, adding each value N_TIMES times). I think I've read that Intel's compiler will sometimes do that for you.

A more general technique is called cache blocking, or loop tiling. The idea is to work on your input data in small blocks that fit in cache. Depending on your algorithm, it can be possible to do various stages of thing on a chunk, then repeat for the next chunk, instead of having each stage loop over the whole input. As always, once you know the right name for a trick (and that it exists at all), you can google up a ton of info.

You could rules-lawyer your way into putting an interchanged loop inside an if (i == 0) block in the part of the code you're allowed to modify. It would still do the same number of additions, but in a more cache-optimal order.

Whimsical answered 14/8, 2015 at 2:0 Comment(12)

Thanks for the info! I'll definitely check out your stuff that you posted there but I don't want to use vectors and such as we have never covered such a thing in class, let alone even talk about it. I did hit the target time using variable splitting (the sums), unrolling the loop (doing several entries each j loop), and using pointers to traverse the array. I will definitely look over and save the info you have provided! Thanks – Biconcave 14/8, 2015 at 2:4

@BlackDahlia1147: With simple loops, the trick is to let the compiler use vectors for you. (That's what auto-vectorization means.) Good compilers will already increment pointers, instead of indexing arrays, when appropriate. (Unless you use -O0...). -O0 stores to memory after every statement, so doing one big set of adds in a single statement is a win with -O0, but not otherwise. Otherwise, just the required order of operations matters for dependency chains / throughput vs. latency. – Whimsical 14/8, 2015 at 2:7

I'm working on a re-edit of that answer for this question. The -O0 requirement was a late addition to the first version. It's still pretty silly, IMO, compared to just programming in ASM if you want to see the diff between pointer increments and array indices. (Since C compilers are free to do that transformation themselves!) – Whimsical 14/8, 2015 at 2:12

@BlackDahlia1147: ok, updated my answer a bit for this question. I reworded some of the ranting about how weird it is to optimize with -O0, with some detailed explanation of why it's going to make you waste time on source changes that aren't needed with an optimizing compiler. – Whimsical 14/8, 2015 at 3:3

For intel X86 is highly OoO, you can actually issue a read for the next cache line before starting calculating current cache line. Then this read that causes a cache miss will be blocked but program can actually continue. In this way fetching for next line is processed when CPU is doing the computation. This is a technique called "touching the data". – Hoffarth 14/8, 2015 at 13:47

@user3528438: If you actually load an iteration iterations ahead, that's called software pipelining. (And you need to duplicate the loop body, minus the loads, for a final iteration). If you do it just as a prefetch, you should use the prefetch instructructions instead. A normal load won't be able to retire until it completes, which could lead to the ROB filling up (because of in-order retirement, needed for precise exceptions). AFAIK, SW prefetching in sequential accesses is a waste of time on Core2 and later, because HW prefetching notices the pattern. – Whimsical 14/8, 2015 at 22:34

@BlackDahlia1147: I'm really glad someone took the time to read all that stuff I took the time to write. Cheers :) – Whimsical 14/8, 2015 at 22:37

I see you already discovered that Clang unrolls to four independent accumulators. I like the term accumulator. I should have used that. – Motorway 13/10, 2015 at 7:40

Interesting result with -fprofile-generate. I have never used this. It seems it can be quite powerful. How often do you use this. Is this used much by others? – Motorway 16/5, 2016 at 7:39

@Zboson: I use it sometimes. I'd recommend it whenever you can easily make test inputs that exercise all the important code-paths. You could get worse results if gcc things something is cold because your test inputs didn't trigger it. Of course, avoid exercising the code that should be cold. x264's Makefile has a make fprofiled build-target, which automates the whole process. It needs a sample video to encode, and encodes it with several different settings. (Multiple profiling runs accumulate, so you can exercise different options in different code-paths.) – Whimsical 16/5, 2016 at 7:53

@PeterCordes re"x86 for example has addressing modes like [rsi + rdx*4] which are as fast as [rdi]. except on Sandybridge and later" This is pretty big except! But I think icelake and newer this is no longer the case. – Ropeway 11/3, 2021 at 15:0

@Noah: Ice Lake made all store-address ports equal so there isn't that no-p7 downside, but still un-laminates indexed addr modes in the same cases as HSW/SKL. Micro fusion and addressing modes. At least the instruction I checked, vpaddd (uops.info/html-tp/ICL/VPADDD_XMM_XMM_M128-Measurements.html) shows 2 retire-slots (fused-domain uops) with vpaddd xmm0,xmm1, [r14+r13*1] vs. one with [r14]. It can't stay micro-fused because it's not 2-operand with a RMW destination. (BMI like blsi r,m are all 2-uop on ICL even non-indexed, weird) – Whimsical 11/3, 2021 at 15:12

To use pointers in the inner loop is a simple matter of first adding a pointer variable:

register double *pj;

then changing the loop to:

for (pj = &(array[0]); pj < &(array[ARRAY_SIZE]); j++) {
        sum += *j++;
        sum1 += *j++;
        sum2 += *j++;
        sum3 += *j++;
        sum4 += *j++;
        sum5 += *j++;
        sum6 += *j++;
        sum7 += *j++;
        sum8 += *j++;
        sum9 += *j;
    }

This keeps the amount of additions the same within the loop (assuming you're counting += and ++ as addition operators, of course) but basically uses pointers rather than array indexes.

With no optimisation¹ on my system, this drops it from 9.868 seconds (CPU time) to 4.84 seconds. Your mileage may vary.

Wame answered 14/8, 2015 at 1:44 Comment(9)

Compilers are pretty clever, but they don't have to be that good to throw away computation of results that are never used. Not a very good homework assignment, IMO. – Whimsical 14/8, 2015 at 2:2

Yea this assignment was really iffy :/ Normally the assignments have more meat to them and make more sense practically. – Biconcave 14/8, 2015 at 2:21

@pax: just saw you said you why you did ++j repeatedly. I assume the assignment was to keep all the FP adds, and reduce all other overhead. – Whimsical 14/8, 2015 at 8:20

This solution is called Duff's Device, IIRC. – Triclinic 4/4, 2020 at 0:42

@paxdiablo, you are absolutely right, I misremembered. en.wikipedia.org/wiki/Duff's_device – Triclinic 10/4, 2020 at 0:48

Before anything else, try to change compiler settings to produce faster code. There is general optimisation, and the compiler might do auto vectorisation.

You would always try several approaches and check what is fastest. As a target, try to get to one cycle per addition or better.

Number of iterations per loop: You add up 10 sums simultaneously. It might be that your processor doesn't have enough registers for that, or it has more. I'd measure the time for 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14... sums per loop.

Number of sums: Having more than one sum means that latency doesn't bite you, just throughput. But more than four or six might not be helpful. Try four sums, with 4, 8, 12, 16 iterations per loop. Or six sums, with 6, 12, 18 iterations.

Caching: You are running through an array of 80,000 bytes. Probably more than L1 cache. Split the array into 2 or 4 parts. Do an outer loop iterating over the two or four subarrays, the next loop from 0 to N_TIMES - 1, and the inner loop adding up values.

And then you can try using vector operations, or multi-threading your code, or using the GPU to do the work.

And if you are forced to use no optimisation, then the "register" keyword might actually work.

Jural answered 4/11, 2016 at 15:54 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Compiler Options

Source level changes

Recommended topics

Hot tags