C for loop indexing: is forward-indexing faster in new CPUs?
Asked Answered
G

7

25

On a mailing list I'm subscribed to, two fairly knowledgeable (IMO) programmers were discussing some optimized code, and saying something along the lines of:

On the CPUs released 5-8 years ago, it was slightly faster to iterate for loops backwards (e.g. for (int i=x-1; i>=0; i--) {...}) because comparing i to zero is more efficient than comparing it to some other number. But with very recent CPUs (e.g. from 2008-2009) the speculative loader logic is such that it works better if the for loop is iterated forward (e.g. for (int i=0; i< x; i++) {...}).

My question is, is that true? Have CPU implementations changed recently such that forward-loop-iterating now has an advantage over backward-iterating? If so, what is the explanation for that? i.e. what changed?

(Yes, I know, premature optimization is the root of all evil, review my algorithm before worrying about micro-optimizations, etc etc... mostly I'm just curious)

Grease answered 23/12, 2009 at 5:42 Comment(1)
What is a "speculative loader" ? Google returns a handful of hits with this article at the top. I'm guessing it's one of those CPU technologies that does parts of future instructions in advance?Busey
T
35

You're really asking about prefetching, not about loop control logic.

In general, loop performance isn't going to be dictated by the control logic (i.e. the increment/decrement and the condition that gets checked every time through). The time it takes to do these things is inconsequential except in very tight loops. If you're interested in that, take a look at John Knoeller's answer for specifics on the 8086's counter register and why it might've been true in the old days that counting down was more efficient. As John says, branch prediction (and also speculation) can play a role in performance here, as can instruction prefetching.

Iteration order can affect performance significantly when it changes the order in which your loop touches memory. The order in which you request memory addresses can affect what is drawn into your cache and also what is evicted from your cache when there is no longer room to fetch new cache lines. Having to go to memory more often than needed is much more expensive than compares, increments, or decrements. On modern CPUs it can take thousands of cycles to get from the processor to memory, and your processor may have to idle for some or all of that time.

You're probably familiar with caches, so I won't go into all those details here. What you may not know is that modern processors employ a whole slew of prefetchers to try to predict what data you're going to need next at different levels of the memory hierarchy. Once they predict, they try to pull that data from memory or lower level caches so that you have what you need when you get around to processing it. Depending on how well they grab what you need next, your performance may or may not improve when using them.

Take a look at Intel's guide to optimizing for hardware prefetchers. There are four prefetchers listed; two for NetBurst chips:

  1. NetBurst's hardware prefetcher can detect streams of memory accesses in either forward or backward directions, and it will try to load data from those locations into the L2 cache.
  2. NetBurst also has an adjacent cache line (ACL) prefetcher, which will automatically load two adjacent cache lines when you fetch the first one.

and two for Core:

  1. Core has a slightly more sophisticated hardware prefetcher; it can detect strided access in addition to streams of contiguous references, so it'll do better if you step through an array every other element, every 4th, etc.
  2. Core also has an ACL prefetcher like NetBurst.

If you're iterating through an array forward, you're going to generate a bunch of sequential, usually contiguous memory references. The ACL prefetchers are going to do much better for forward loops (because you'll end up using those subsequent cache lines) than for backward loops, but you may do ok making memory references backward if the prefetchers can detect this (as with the hardware prefetchers). The hardware prefetchers on the Core can detect strides, which is helpful for for more sophisticated array traversals.

These simple heuristics can get you into trouble in some cases. For example, Intel actually recommends that you turn off adjacent cache line prefetching for servers, because they tend to make more random memory references than desktop user machines. The probability of not using an adjacent cache line is higher on a server, so fetching data you're not actually going to use ends up polluting your cache (filling it with unwanted data), and performance suffers. For more on addressing this kind of problem, take a look at this paper from Supercomputing 2009 on using machine learning to tune prefetchers in large data centers. Some guys at Google are on that paper; performance is something that is of great concern to them.

Simple heuristics aren't going to help you with more sophisticated algorithms, and you might have to start thinking about the sizes of your L1, L2, etc. caches. Image processing, for example, often requires that you perform some operation on subsections of a 2D image, but the order you traverse the image can affect how well useful pieces of it stay in your cache without being evicted. Take a look at Z-order traversals and loop tiling if you're interested in this sort of thing. It's a pretty basic example of mapping the 2D locality of image data to the 1D locality of memory to improve performance. It's also an area where compilers aren't always able to restructure your code in the best way, but manually restructuring your C code can improve cache performance drastically.

I hope this gives you an idea of how iteration order affects memory performance. It does depend on the particular architecture, but the ideas are general. You should be able to understand prefetching on AMD and Power if you can understand it on Intel, and you don't really have to know assembly to structure your code to take advantage of memory. You just need to know a little computer architecture.

Trevar answered 23/12, 2009 at 7:51 Comment(3)
The adjacent cache-line spatial prefetcher isn't biased towards forward. It tries to fill in the other half of the 128B-aligned pair of cache line, whether that's forwards or backwards. From my reading of Intel's description of the Sandybridge-family prefetchers in their optimization manual, there doesn't seem to be any prefetch-based reason to prefer forwards vs. backward streams, since it can track an equal number of each kind of stream. However, iterating backward can defeat auto-vectorization, or make gcc do it very badly.Etti
I've been meaning to try this idea sometime: iterate forward in one loop, iterate backward in the next loop over the same array. Hopefully this gives as much reuse of cached data as possible before we get to addresses that have already been evicted. I think looping over an array even just slightly too large for the cache will normally miss almost every time, since the line we need next is always the oldest, and cache replacement policy heuristics are more or less LRU.Etti
BTW, John Knoeller's answer is wrong: you can still save an insn by looping towards zero (either up from negative numbers or down from positive numbers) on most architectures, not just x86. In some tiny loops, it can be the difference between issuing at one iteration per 1 clock or one iteration per 2 clocks for 4 vs. 5 uops (this is why unrolling is good). However, compilers are either bad at this (gcc), or optimize non-array up-counts to down-counts (clang). Flip the compiler to gcc on that godbolt link to see the how gcc fails to save an insn counting downEtti
B
14

I don't know. But I do know how to write a quick benchmark with no guarantees of scientific validity (actually, one with rather strict guarantees of invalidity). It has interesting results:

#include <time.h>
#include <stdio.h>

int main(void)
{
    int i;
    int s;
    clock_t start_time, end_time;
    int centiseconds;

    start_time = clock();
    s = 1;
    for (i = 0; i < 1000000000; i++)
    {
        s = s + i;
    }
    end_time = clock();
    centiseconds = (end_time - start_time)*100 / CLOCKS_PER_SEC;
    printf("Answer is %d; Forward took %ld centiseconds\n", s, centiseconds);

    start_time = clock();
    s = 1;
    for (i = 999999999; i >= 0; i--)
    {
        s = s + i;
    }
    end_time = clock();
    centiseconds = (end_time - start_time)*100 / CLOCKS_PER_SEC;
    printf("Answer is %d; Backward took %ld centiseconds\n", s, centiseconds);

    return 0;
}

Compiled with -O9 using gcc 3.4.4 on Cygwin, running on an "AMD Athlon(tm) 64 Processor 3500+" (2211 MHz) in 32 bit Windows XP:

Answer is -1243309311; Forward took 93 centiseconds
Answer is -1243309311; Backward took 92 centiseconds

(Answers varied by 1 either way in several repetitions.)

Compiled with -I9 using gcc 4.4.1 running on an "Intel(R) Atom(TM) CPU N270 @ 1.60GHz" (800 MHz and presumably only one core, given the program) in 32 bit Ubuntu Linux.

Answer is -1243309311; Forward took 196 centiseconds
Answer is -1243309311; Backward took 228 centiseconds

(Answers varied by 1 either way in several repetitions.)

Looking at the code, the forward loop is translated to:

; Gcc 3.4.4 on Cygwin for Athlon      ; Gcc 4.4.1 on Ubuntu for Atom
L5:                                .L2:
    addl    %eax, %ebx                 addl    %eax, %ebx
    incl    %eax                       addl    $1, %eax
    cmpl    $999999999, %eax           cmpl    $1000000000, %eax
    jle     L5                         jne     .L2

The backward to:

L9:                                .L3:
    addl    %eax, %ebx                 addl    %eax, %ebx
    decl    %eax                       subl    $1, $eax
    jns     L9                         cmpl    $-1, %eax
                                       jne .L3

Which shows, if not much else, that GCC's behaviour has changed between those two versions!

Pasting the older GCC's loops into the newer GCC's asm file gives results of:

Answer is -1243309311; Forward took 194 centiseconds
Answer is -1243309311; Backward took 133 centiseconds

Summary: on the >5 year old Athlon, the loops generated by GCC 3.4.4 are the same speed. On the newish (<1 year?) Atom, the backward loop is significantly faster. GCC 4.4.1 has a slight regression for this particular case which I personally am not bothered about in the least, given the point of it. (I had to make sure that s is used after the loop, because otherwise the compiler would elide the computation altogether.)

[1] I can never remember the command for system info...

Busey answered 23/12, 2009 at 6:16 Comment(4)
You can get a decent amount of CPU info with cat /proc/cpuinfoTrevar
@tgamblin: thanks! I thought there was something in /bin too, but this one is enough. It even works in Cygwin which is a pleasant surprise.Busey
Try just running pure repetition; does the compiler optimize it to a simple loop instruction?Tierell
@Electro: if you don't use the control variable (e.g. print it out or something), some compilers will just remove empty loops altogether.Trevar
H
8

Yes. but with a caveat. The idea that looping backwards is faster never applied to all older CPUs. It's an x86 thing (as in 8086 through 486, possibly Pentium, although I don't think any further).

That optimization never applied to any other CPU architecture that I know of.

Here's why.

The 8086 had a register that was specifically optimized for use as a loop counter. You put your loop count in CX, and then there are several instructions that decrement CX and then set condition codes if it goes to zero. In fact there was an instruction prefix you could put before other instructions (the REP prefix) that would basically iterate the other instruction until CX got to 0.

Back in the days when we counted instructions and instructions had known fixed cycle counts using cx as your loop counter was the way to go, and cx was optimized for counting down.

But that was a long time ago. Ever since the Pentium, those complex instructions have been slower overall than using more, and simpler instructions. (RISC baby!) The key thing we try to do these days is try to put some time between loading a register and using it because the pipelines can actually do multiple things per cycle as long as you don't try to use the same register for more than one thing at a time.

Nowdays the thing that kills performance isn't the comparison, it's the branching, and then only when the branch prediction predicts wrong.

Hemato answered 23/12, 2009 at 6:22 Comment(2)
Even if you don't use the loop instruction, it still saves an insn to count downwards. Counting upwards, you need to compare against an end-point. Downwards, you only need to compare against zero, and you can do that without an extra compare insn in most architectures, including RISC. e.g. in ARM, subs r2, r2, #1 does r2 -= 1, setting flags so you can branch on r2 being 0 or not. (The s suffix means "set flags"). On RISC machines without a flag register, you'd just branch on r2 being non-zero instead of running a compare insn to produce a 0 or non-zero in another reg.Etti
Also, this description of looping on 8086 with CX is actually wrong. The dec insn to decrement and set flags works on any register. The loop insn implicitly uses CX, but doesn't set flags (it's a slow decrement-and-branch insn). rep string store/copy/scan instructions can scan forwards or backwards in memory according to the count in CX, and the setting of the Direction Flag. Total insn count can still matter, but it's often not the bottleneck.Etti
A
7

I stumbled upon this question after observing a significant drop in performance when iterating over an array backwards vs forwards. I was afraid it would be the prefetcher, but the previous answers convinced me this was not the case. I then investigated further and found out that it looks like GCC (4.8.4) is unable to exploit the full power of SIMD operations in a backward loop.

In fact, compiling the following code (from here) with -S -O3 -mavx:

  for (i = 0; i < N; ++i)
    r[i] = (a[i] + b[i]) * c[i];

leads to essentially:

.L10:
    addl    $1, %edx
    vmovupd (%rdi,%rax), %xmm1
    vinsertf128     $0x1, 16(%rdi,%rax), %ymm1, %ymm1
    vmovupd (%rsi,%rax), %xmm0
    vinsertf128     $0x1, 16(%rsi,%rax), %ymm0, %ymm0
    vaddpd  (%r9,%rax), %ymm1, %ymm1
    vmulpd  %ymm0, %ymm1, %ymm0
    vmovupd %xmm0, (%rcx,%rax)
    vextractf128    $0x1, %ymm0, 16(%rcx,%rax)
    addq    $32, %rax
    cmpl    %r8d, %edx
    jb      .L10

i.e. assembly code that uses the AVX extensions to perform four double operations in parallel (for example, vaddpd and vmulpd).

Conversely, the following code compiled with the same parameters:

  for (i = 0; i < N; ++i)
    r[N-1-i] = (a[N-1-i] + b[N-1-i]) * c[N-1-i];

produces:

.L5:
    vmovsd  a+79992(%rax), %xmm0
    subq    $8, %rax
    vaddsd  b+80000(%rax), %xmm0, %xmm0
    vmulsd  c+80000(%rax), %xmm0, %xmm0
    vmovsd  %xmm0, r+80000(%rax)
    cmpq    $-80000, %rax
    jne     .L5

which only performs one double operation at the time (vaddsd, vmulsd).

This fact alone may be responsible for a factor of 4 between the performance when iterating backward vs forward.

Using -ftree-vectorizer-verbose=2, it looks like the problem is storing backwards: "negative step for store". In fact, if a, b, and c are read backwards, but r is written into in the forward direction, and the code is vectorized again.

Ahner answered 21/4, 2016 at 16:23 Comment(0)
E
2

It probably doesn't make a hoot of difference speed-wise, but I often write:

for (i = n; --i >= 0; ) blah blah

which I think at one time generated cleaner assembly.

Of course, in answering this kind of question, I run the risk of affirming that this is important. It's a micro-optimization kind of question, which is closely related to premature optimization, which everybody says you shouldn't do, but nevertheless SO is awash in it.

Engineering answered 23/12, 2009 at 14:38 Comment(0)
S
1

When optimizing loops I'd rather look into loop unrolling (as it cuts down the number of comparisons vs. the exit value, and it may be optimized for parallel processing (MMX) depending on what goes on inside the loop).

Screwworm answered 23/12, 2009 at 5:48 Comment(0)
C
0

No, we can't say that CPU implementations have changed to make forward looping faster. And that has very little to do with the CPUs themselves.

It has to do with the fact that you haven't specified which CPU you're talking about, nor which compiler.

You cannot ask a blanket question about CPU issues with the C tag and expect to get an intelligent answer simply because nothing in the C standard mandates how fast CPU s should be at various operations.

If you'd like to rephrase your question to target a specific CPU and machine language (since what machine language you get out of a C compiler depends entirely on the compiler), you may get a better answer.

In either case, it shouldn't matter. You should be relying on the fact that the people that wrote your compiler know a great deal more than you about how to eke the last inch of performance from the various CPUs.

The direction in which you should be iterating has always been dictated by what you have to do. For example, if you have to process array elements in ascending order, you use:

for (i = 0; i < 1000; i++) { process (a[i]); }

rather than:

for (i = 999; i >= 0; i--) { process (a[999-i]); }

simply because any advantage you may gain in going backwards is more than swamped by the extra calculations on i. It may well be that a naked loop (no work done in the body) may be faster in one direction than another but, if you have such a naked loop, it's not doing any real work anyway.

As an aside, it may well be that both those loops above will come down to the same machine code anyway. I've seen some of the code put out by the GCC optimizer and it made my head spin. Compiler writers are, in my opinion, a species alone when it comes to insane levels of optimization.

My advice: always program for readability first then target any specific performance problems you have ("get it working first, then get it working fast").

Chopin answered 23/12, 2009 at 5:46 Comment(10)
It's really annoying that nobody ever answers performance questions on here. People ask a question that might have an interesting answer, then all the parrots come out and say "get it working first, then get it working fast". Yes. That's a great rule of thumb, but what if someone (god forbid) actually got to the "get it working fast" part? They'll never find an answer on SO.Trevar
To add to that, there are PLENTY of loops in high performance computing where traversal order matters a lot. It's not just forward or back, but what memory you touch in the loop and how it hits your cache and how well the prefetcher can predict it that will determine performance. Look at z-order traversals, which were MADE to preserve locality in situations like this.Trevar
@tgamblin, there is no answer based on the available data simply because there are holes in the question. You don't know what the compiler will output, you don't know what the CPU is, and so on. And I've answered plenty of the "people at the get it working fast" stage here on SO. The answer there is to profile and target problem areas (all of which depends on the target environments).Chopin
... If you want performance, you have two ways to go. Code in assembly for your target environment or trust in the compiler writers. I prefer the latter. In any case, you're free to answer the question differently if you're not happy with my answer.Chopin
One can safely assume that he wants it to know for the most common consumer CPUs, which are Intel's and AMD's at the moment.Enrich
You may assume that, I prefer not to. And which compiler? VC, GCC, Code::Blocks? Which version of said compiler? My point is there are too many unknowns in the question to make a blanket statement. And even if all that were known, I still stand by my comment that the compiler writers will leave regular folk eating their dust in terms of optimization.Chopin
I was hoping to learn something interesting about recent trends in speculative execution technology, not to get a lecture about the evils of premature optimization. That's why I included the final paragraph in my question -- to head off exactly this sort of unhelpful response.Grease
Then you should not have asked about C code. There is a disconnect between C and the CPU which you haven't made a link for. You would have been better off asking about assembly and specifying the CPUs your interested in.Chopin
@pax: That is a whole lot of BS. You don't need to talk about specific assembly to talk about performance. I work in high performance computing and the vast majority of optimizations people make to C code (and C++, and Fortran) don't involve touching assembly at all. They involve restructuring code (especially loops) to perform better with different memory systems, compilers, processors, etc. You CAN give general recommendations about performance, and you can talk about how C code will perform on different machines in general without mentioning specific ones.Trevar
Well, then we're going to have to agree to disagree on that one. @JeremyF, if you actually think this answer is unhelpful, let me know and I'll delete it. I'm trying to help but if it's unappreciated, I'd rather move on.Chopin

© 2022 - 2024 — McMap. All rights reserved.