Is it faster to count down than it is to count up?

Asked 12/5, 2010 at 21:56 Answered 9/5, 2017 at 18:22

139

Our computer science teacher once said that for some reason it is faster to count down than to count up. For example if you need to use a FOR loop and the loop index is not used somewhere (like printing a line of N * to the screen)

I mean that code like this:

for (i = N; i >= 0; i--)  
  putchar('*');

is faster than:

for (i = 0; i < N; i++)  
  putchar('*');

Is it really true? And if so, does anyone know why?

Wommera answered 12/5, 2010 at 21:56 Comment(19)

Which computer scientist? In what publication? – Negligible 12/5, 2010 at 21:57

duplicate of #1657006 ? – Clipping 12/5, 2010 at 21:58

I am not sure if all processors, but some have a zero register, which is maybe sliiiiightly faster to compare with ;) – Espousal 12/5, 2010 at 22:4

Not quite a duplicate. The other is regarding Java, this is tagged as C. About 10 years ago this would have made a slightly noticable difference to some Java applications. Unless this is your bottleneck, it's not worth your time. – Myrta 12/5, 2010 at 22:4

A key part of this question is the "loop index is not used somewhere" phrase. The comparison operation may be slightly faster, but odds are in most loops, one is sequentially operating on memory. When counting forward, you are likely to have the entire block of memory you are working on through all loops loaded at once, causing only one page fault (which costs the same amount of time as dozens if not hundreds of instructions). If you're counting backwards and this causes you to walk backwards through a continuous chunk of memory, you are likely to greatly increase the number of page faults. – Dieldrin 12/5, 2010 at 22:40

It's conceivable that you could save a nanosecond per iteration, or about as much as a one hair on a family of woolly mammoths. The putchar is using 99.9999% of the time (give or take). – Nolde 12/5, 2010 at 22:44

Premature optimization is the root of all evil. Use whichever form seems right to you, because (as you already know) they're logically equivalent. The hardest part of programming is communicating the theory of the program to other programmers (and yourself!). Using a construct that makes you or some other programmer ever look at it for more than a second is a net loss. You will never recoup the time anyone spends thinking "why does this count down?" – Synsepalous 12/5, 2010 at 23:7

@David M: agree. For better or worse, humans count up, which means our natural loops do so also. Down-counting loops are fine, and there are good algorithmic reasons in some cases to have them, but they're usually because of the nature of the data. Having all loops count down gratuitously makes you wonder why every time you read one. – Thrum 13/5, 2010 at 0:54

+1 for "it hardly makes a difference" and for "Premature optimization is the root of all evil". If anyone really wants to split hairs then I would say that it depends on the compiler, to some extent, but ultimately it depends on the processor running the code. Neither of which are specified here. Mawg's law says "in general, it is better to avoid sweeping statements". – Limbic 13/5, 2010 at 2:2

Well, it shouldn't make much of a difference these days except perhaps if the program is interpreted or compiled to some form of code that isn't native machine code (e.g. bytecode, P-code, etc.) rather than compiled to native machine code. I do recall something similar being said of the ++ operators in PHP at one point though... $i++ took longer than ++$i for some reason. – Flipper 13/5, 2010 at 2:48

facepalm (yes, even if true in some sense, you don't confuse young padawans like that) – Toreutics 18/5, 2010 at 7:46

The first loop is obviously slower, since it calls putchar 11 times, whereas the second one only calls it 10 times. – Vanpelt 4/6, 2010 at 22:13

Did you notice that if i is unsigned, the first loop is an infinite one? – Ramp 16/4, 2012 at 14:36

Here is some fuel for the fire: i++ ==> temp = i;i=i+1; return temp; ++i ==> return i=i+1; – Cornia 11/11, 2012 at 14:56

If N evolves some day to calculateN() then the decrement variant would be a great advangage ;) – Recollected 10/5, 2013 at 8:43

Most people who disagree the need of decrement loop will be computer programmers and in embedded micro-controller programming may be even a microsecond will count. In case of an interrupt service routine you may need to exit as fast as possible. Also with each iteration you may be saving only 1 microsecond but consider the case if the loop is executed 1000 times then you may be saving a millisecond which is a lot of time in micro-controller level highly time critical applications. So what your computer teacher said is true but its not needed in computer programming. – Dumfries 29/10, 2014 at 9:7

The AVR manuals actually recommend counting down because comparing to zero can be faster than comparing to an immediate value. – Meilhac 16/2, 2018 at 23:15

Is comparing to zero faster than comparing to any other number? – Lucarne 14/8, 2018 at 1:1

developer.arm.com/documentation/dui0472/i/… – Yellowwood 27/2, 2023 at 6:19

386

Is it really true? and if so does anyone know why?

In ancient days, when computers were still chipped out of fused silica by hand, when 8-bit microcontrollers roamed the Earth, and when your teacher was young (or your teacher's teacher was young), there was a common machine instruction called decrement and skip if zero (DSZ). Hotshot assembly programmers used this instruction to implement loops. Later machines got fancier instructions, but there were still quite a few processors on which it was cheaper to compare something with zero than to compare with anything else. (It's true even on some modern RISC machines, like PPC or SPARC, which reserve a whole register to be always zero.)

So, if you rig your loops to compare with zero instead of N, what might happen?

You might save a register
You might get a compare instruction with a smaller binary encoding
If a previous instruction happens to set a flag (likely only on x86 family machines), you might not even need an explicit compare instruction

Are these differences likely to result in any measurable improvement on real programs on a modern out-of-order processor? Highly unlikely. In fact, I'd be impressed if you could show a measurable improvement even on a microbenchmark.

Summary: I smack your teacher upside the head! You shouldn't be learning obsolete pseudo-facts about how to organize loops. You should be learning that the most important thing about loops is to be sure that they terminate, produce correct answers, and are easy to read. I wish your teacher would focus on the important stuff and not mythology.

Crewel answered 12/5, 2010 at 22:36 Comment(12)

++ And besides, the putchar takes many orders of magnitude longer than the loop overhead. – Nolde 12/5, 2010 at 22:39

It's not strictly mythology: if he is doing some sort of uber-optimized real-time system, it would come in handy. But that sort of hacker would probably know all this already and certainly wouldn't be confusing entry-level CS students with arcana. – Ballew 12/5, 2010 at 22:50

+1 Also, if this were profitable, it's likely the compiler will optimize it anyway, so the programmer is free to write the loop for readability. – Ninetieth 13/5, 2010 at 0:53

@Jay, sorry, it wouldn't. The optimization tends to change program behavior detectably and the compiler can't perform the proof of correctness that this optimization is safe. – Ozan 13/5, 2010 at 1:38

@Joshua: In what way would this optimization be detectable? As the questioner said, the loop index isn't used in the loop itself, so provided the number of iterations is the same there is no change in behavior. In terms of a proof of correctness, making the variable substitution j=N-i shows that the two loops are equivalent. – Trapper 13/5, 2010 at 10:3

@Trapper most of the time where this optimization is useful it's detectable. You're right in this case it's not detectable. – Ozan 13/5, 2010 at 15:10

+1 for the Summary. Don't sweat it because on modern hardware it makes virtually no difference. It made virtually no difference 20 years ago either. If you think you have to care, time it both ways, see no clear difference, and go back to writing the code clearly and correctly. – Handful 15/5, 2010 at 18:4

Segway... until a year or so ago, this was still important in JavaScript. Reverse While loops were significantly faster than incremental loops. Though JS engines have since been re-optimized, you never know when it might be undone in the future. So it's nice to know, but certainly not in a C class, nor an intro-level. – Lillian 12/2, 2013 at 5:26

I don't know if I should upvote for the body or downvote for the summary. – Recollected 10/5, 2013 at 8:47

Well, not good to say it's make no difference, in certain usage it's can be very important. I talking about low power embedded solution, like when your sensor work on energy harvesting, or you need to work on battery for 10 years and you wake-up for some millisecond every hours... But clearly a bad thing to say to student... Make your code clear and easy to read is lot more important... And on "normal" computer it's change nothing, it's even in numberous case counterproductive because compiler optimization or cache system doesn't work backward. – Sanitation 12/7, 2015 at 17:18

Note about the PPC processor: The Branch and Decrement if Not Zero instruction (bdnz) was the only loop instruction within the loop. Since it also executed on the branch unit with a single cycle latency, it was always executed parallel to some other instruction within the loop. So, basically, bdnz based loop control was for free. This was a nice micro-optimization compared to the fully data dependent increment-compare-branch upward counting which would have used the integer unit for two cycles, and wasted two registers. – Armagh 28/4, 2017 at 21:14

ARM64 has single instructions cbz/cbnz to branch if a register is (non)zero. Comparing with any other value requires two instructions cmp / b.cc and moreover trashes the flags register. – Percentile 13/11, 2023 at 0:32

Here's what might happen on some hardware depending on what the compiler can deduce about the range of the numbers you're using: with the incrementing loop you have to test i<N each time round the loop. For the decrementing version, the carry flag (set as a side effect of the subtraction) may automatically tell you if i>=0. That saves a test per time round the loop.

In reality, on modern pipelined processor hardware, this stuff is almost certainly irrelevant as there isn't a simple 1-1 mapping from instructions to clock cycles. (Though I could imagine it coming up if you were doing things like generating precisely timed video signals from a microcontroller. But then you'd write in assembly language anyway.)

Seto answered 12/5, 2010 at 22:1 Comment(4)

wouldn't that be the zero flag and not the carry flag? – Wommera 12/5, 2010 at 22:23

@Wommera In this case you might want to reach zero, print a result, decrement further, and then find you've gone one below zero causing a carry (or borrow). But written slightly differently a decrementing loop might use the zero flag instead. – Seto 12/5, 2010 at 22:34

Just to be perfectly pedantic, not all modern hardware is pipelined. Embedded processors will have much more relevance to this sort of microoptimization. – Ballew 12/5, 2010 at 22:47

@Paul As I have some experience with Atmel AVRs I didn't forget to mention microcontrollers... – Seto 12/5, 2010 at 22:55

In the Intel x86 instruction set, building a loop to count down to zero can usually be done with fewer instructions than a loop that counts up to a non-zero exit condition. Specifically, the ECX register is traditionally used as a loop counter in x86 asm, and the Intel instruction set has a special jcxz jump instruction that tests the ECX register for zero and jumps based on the result of the test.

However, the performance difference will be negligible unless your loop is already very sensitive to clock cycle counts. Counting down to zero might shave 4 or 5 clock cycles off each iteration of the loop compared to counting up, so it's really more of a novelty than a useful technique.

Also, a good optimizing compiler these days should be able to convert your count up loop source code into count down to zero machine code (depending on how you use the loop index variable) so there really isn't any reason to write your loops in strange ways just to squeeze a cycle or two here and there.

Fullbodied answered 12/5, 2010 at 22:18 Comment(7)

I've seen Microsoft's C++ compiler from a few years back make that optimization. It's able to see that the loop index isn't used, so it rearranges it to the fastest form. – Skulduggery 12/5, 2010 at 22:47

@Mark: The Delphi compiler as well, starting in 1996. – Fullbodied 12/5, 2010 at 22:59

@MarkRansom Actually, the compiler may be able to implement the loop using count down even if the loop index variable is used, depending on how it is used in the loop. If the loop index variable is used only to index into static arrays (arrays of known size at compile time), the array indexing can be done as ptr + array size - loop index var, which can still be a single instruction in x86. It's pretty wild to be debugging assembler and see the loop counting down but the array indices going up! – Fullbodied 23/1, 2012 at 18:58

@Fullbodied yeah, that was great fun seeing that happen in the CPU view of the Delphi debugger :) – Dulin 15/2, 2012 at 15:0

Actually today your compiler probably won't use the loop and jecxz instructions as they're slower than a dec / jnz pair. – Choroid 15/7, 2013 at 12:8

@FUZxxl All the more reason not to write your loop in strange ways. Write human readable clear code and let the compiler do its job. – Fullbodied 15/7, 2013 at 17:18

@dthorpe, Does today's Intel still loop faster backwards? Or are they equal now? – Anatomical 12/8, 2017 at 9:31

Yes..!!

Counting from N down to 0 is slightly faster that Counting from 0 to N in the sense of how hardware will handle comparison..

Note the comparison in each loop

i>=0
i<N

Most processors have comparison with zero instruction..so the first one will be translated to machine code as:

Load i
Compare and jump if Less than or Equal zero

But the second one needs to load N form Memory each time

load i
load N
Sub i and N
Compare and jump if Less than or Equal zero

So it is not because of counting down or up.. But because of how your code will be translated into machine code..

So counting from 10 to 100 is the same as counting form 100 to 10
But counting from i=100 to 0 is faster than from i=0 to 100 - in most cases
And counting from i=N to 0 is faster than from i=0 to N

Note that nowadays compilers may do this optimization for you (if it is smart enough)
Note also that pipeline can cause Belady's anomaly-like effect (can not be sure what will be better)
At last: please note that the 2 for loops you have presented are not equivalent.. the first prints one more * ....

Related: Why does n++ execute faster than n=n+1?

Collin answered 12/5, 2010 at 22:4 Comment(9)

noted that the loops are not the same, forgot to put the "=" sign in one of them – Wommera 12/5, 2010 at 22:13

so what you're saying is it's not faster to count down, it's just faster to compare to zero than any other value. Meaning counting from 10 to 100 and counting down from a 100 to 10 would be the same? – Wommera 12/5, 2010 at 22:19

Yes.. it is not the matter of "counting down or up".. but it is the matter of "comparing to what".. – Collin 12/5, 2010 at 22:21

so I guess the same holds for counting from -10 to 0 being "faster" (on some ancient system) then counting from 0 to -10, right? – Cusped 13/5, 2010 at 5:34

While this is true the assembler level. Two things combine to meke untrue in reality -- modern hardware using long pipes and speculative instructions will sneak in the "Sub i and N" without incurring an extra cycle -- and -- even the crudest compiler will optimise the the "Sub i and N" out of existence. – Audreyaudri 13/5, 2010 at 5:47

@Cusped Doesn't have to be an ancient system. It just has to be an instruction set where there is a compare to zero operation which is in some way faster/better than the equivalent compare to register value. x86 has it in jcxz. x64 still has it. Not ancient. Also, RISC architectures often special-case zero. The DEC AXP Alpha chip (in the MIPS family), for example, had a "zero register" - read as zero, write does nothing. Comparing against the zero register instead of against a general register that contains a zero value reduces inter instruction dependencies and helps out of order execution. – Fullbodied 23/1, 2012 at 19:7

@Betamoo: I am often wondering why not better / more correct answers (which is yours) are not more appreciated by more votes and come to conclusion that too often on stackoverflow votes are influenced by reputation (in points) of a person that answers (which is very very bad) and not by the answer correctness – Covered 8/4, 2013 at 10:4

@Artur: Because this answer isn't better. The supposed "compare-to-zero" instruction either doesn't exist or, in the case of jcxz that dthorpe mentioned, is slower that the alternative. The advantage is real, not because of some supposed special "compare-to-zero" instruction, but because the compare-to-zero flags are generally set for free as a side effect of another instruction. But even worse than that is the absurd claim in this answer that "the second one needs to load N form [sic] Memory each time". A compiler that can't figure out how to enregister a constant is absolute garbage. – Hamitic 11/4, 2015 at 4:29

@Collin your comment pointing out that this happens because "comparison with 0 is faster than comparison with a different number" is the most important piece of information on this whole question. – Yellowwood 27/2, 2023 at 6:17

In C to psudo-assembly:

for (i = 0; i < 10; i++) {
    foo(i);
}

turns into

    clear i
top_of_loop:
    call foo
    increment i
    compare 10, i
    jump_less top_of_loop

while:

for (i = 10; i >= 0; i--) {
    foo(i);
}

turns into

    load i, 10
top_of_loop:
    call foo
    decrement i
    jump_not_neg top_of_loop

Note the lack of the compare in the second psudo-assembly. On many architectures there are flags that are set by arithmatic operations (add, subtract, multiply, divide, increment, decrement) which you can use for jumps. These often give you what is essentially a comparison of the result of the operation with 0 for free. In fact on many architectures

x = x - 0

is semantically the same as

compare x, 0

Also, the compare against a 10 in my example could result in worse code. 10 may have to live in a register, so if they are in short supply that costs and may result in extra code to move things around or reload the 10 every time through the loop.

Compilers can sometimes rearrange the code to take advantage of this, but it is often difficult because they are often unable to be sure that reversing the direction through the loop is semantically equivalent.

Supply answered 12/5, 2010 at 23:12 Comment(2)

Is it possible that there is a diff of 2 instructions instead of only 1? – Anatomical 12/8, 2017 at 9:36

Also, why is it hard to be sure of that? As long as the var i is not used within the loop, obviously you can flip it over isn't it? – Anatomical 12/8, 2017 at 9:38

Count down faster in case like this:

for (i = someObject.getAllObjects.size(); i >= 0; i--) {…}

because someObject.getAllObjects.size() executes once at the beginning.

Sure, similar behaviour can be achieved by calling size() out of the loop, as Peter mentioned:

size = someObject.getAllObjects.size();
for (i = 0; i < size; i++) {…}

Fillip answered 12/5, 2010 at 22:4 Comment(3)

It's not "definitely faster". In many cases that size() call could be hoisted out of the loop when counting up, so it would still only get called once. Obviously this is language and compiler dependent (and code dependent; eg. in C++ it won't get hoisted if size() is virtual), but it's far from definite either way. – Ackler 12/5, 2010 at 22:29

@Peter: Only if the compiler knows for certain that size() is idempotent across the loop. That's probably nearly always not the case, unless the loop is very simple. – Nautilus 13/5, 2010 at 1:1

@LawrenceDol, The compiler will definitely know it unless you have dynamic code compilatino using exec. – Anatomical 12/8, 2017 at 9:39

What matters much more than whether you're increasing or decreasing your counter is whether you're going up memory or down memory. Most caches are optimized for going up memory, not down memory. Since memory access time is the bottleneck that most programs today face, this means that changing your program so that you go up memory might result in a performance boost even if this requires comparing your counter to a non-zero value. In some of my programs, I saw a significant improvement in performance by changing my code to go up memory instead of down it.

Skeptical? Just write a program to time loops going up/down memory. Here's the output that I got:

Average Up Memory   = 4839 mus
Average Down Memory = 5552 mus

Average Up Memory   = 18638 mus
Average Down Memory = 19053 mus

(where "mus" stands for microseconds) from running this program:

#include <chrono>
#include <iostream>
#include <random>
#include <vector>
using namespace std;

//Sum all numbers going up memory.
template<class Iterator, class T>
inline void sum_abs_up(Iterator first, Iterator one_past_last, T &total) {
  T sum = 0;
  auto it = first;
  do {
    sum += *it;
    it++;
  } while (it != one_past_last);
  total += sum;
}

//Sum all numbers going down memory.
template<class Iterator, class T>
inline void sum_abs_down(Iterator first, Iterator one_past_last, T &total) {
  T sum = 0;
  auto it = one_past_last;
  do {
    it--;
    sum += *it;
  } while (it != first);
  total += sum;
}

//Time how long it takes to make num_repititions identical calls to sum_abs_down().
//We will divide this time by num_repitions to get the average time.
template<class T>
chrono::nanoseconds TimeDown(vector<T> &vec, const vector<T> &vec_original,
                             size_t num_repititions, T &running_sum) {
  chrono::nanoseconds total{0};
  for (size_t i = 0; i < num_repititions; i++) {
    auto start_time = chrono::high_resolution_clock::now();
    sum_abs_down(vec.begin(), vec.end(), running_sum);
    total += chrono::high_resolution_clock::now() - start_time;
    vec = vec_original;
  }
  return total;
}

template<class T>
chrono::nanoseconds TimeUp(vector<T> &vec, const vector<T> &vec_original,
                           size_t num_repititions, T &running_sum) {
  chrono::nanoseconds total{0};
  for (size_t i = 0; i < num_repititions; i++) {
    auto start_time = chrono::high_resolution_clock::now();
    sum_abs_up(vec.begin(), vec.end(), running_sum);
    total += chrono::high_resolution_clock::now() - start_time;
    vec = vec_original;
  }
  return total;
}

template<class Iterator, typename T>
void FillWithRandomNumbers(Iterator start, Iterator one_past_end, T a, T b) {
  random_device rnd_device;
  mt19937 generator(rnd_device());
  uniform_int_distribution<T> dist(a, b);
  for (auto it = start; it != one_past_end; it++)
    *it = dist(generator);
  return ;
}

template<class Iterator>
void FillWithRandomNumbers(Iterator start, Iterator one_past_end, double a, double b) {
  random_device rnd_device;
  mt19937_64 generator(rnd_device());
  uniform_real_distribution<double> dist(a, b);
  for (auto it = start; it != one_past_end; it++)
    *it = dist(generator);
  return ;
}

template<class ValueType>
void TimeFunctions(size_t num_repititions, size_t vec_size = (1u << 24)) {
  auto lower = numeric_limits<ValueType>::min();
  auto upper = numeric_limits<ValueType>::max();
  vector<ValueType> vec(vec_size);

  FillWithRandomNumbers(vec.begin(), vec.end(), lower, upper);
  const auto vec_original = vec;
  ValueType sum_up = 0, sum_down = 0;

  auto time_up   = TimeUp(vec, vec_original, num_repititions, sum_up).count();
  auto time_down = TimeDown(vec, vec_original, num_repititions, sum_down).count();
  cout << "Average Up Memory   = " << time_up/(num_repititions * 1000) << " mus\n";
  cout << "Average Down Memory = " << time_down/(num_repititions * 1000) << " mus"
       << endl;
  return ;
}

int main() {
  size_t num_repititions = 1 << 10;
  TimeFunctions<int>(num_repititions);
  cout << '\n';
  TimeFunctions<double>(num_repititions);
  return 0;
}

Both sum_abs_up and sum_abs_down do the same thing (sum the vector of numbers) and are timed the same way with the only difference being that sum_abs_up goes up memory while sum_abs_down goes down memory. I even pass vec by reference so that both functions access the same memory locations. Nevertheless, sum_abs_up is consistently faster than sum_abs_down. Give it a run yourself (I compiled it with g++ -O3).

It's important to note how tight the loop that I'm timing is. If a loop's body is large (has a lot of code) then it likely won't matter whether its iterator goes up or down memory since the time it takes to execute the loop's body will likely completely dominate. Also, it's important to mention that with some rare loops, going down memory is sometimes faster than going up it. But even with such loops it was never the case that going up memory was always slower than going down (unlike small-bodied loops that go up memory, for which the opposite is frequently true; in fact, for a small handful of loops I've timed, the increase in performance by going up memory was 40+%).

The point is, as a rule of thumb, if you have the option, if the loop's body is small, and if there's little difference between having your loop go up memory instead of down it, then you should go up memory.

FYI vec_original is there for experimentation, to make it easy to change sum_abs_up and sum_abs_down in a way that makes them alter vec while not allowing these changes to affect future timings. I highly recommend playing around with sum_abs_up and sum_abs_down and timing the results.

Stinky answered 27/4, 2017 at 22:29 Comment(2)

Can you please recommend any literature/blog/video that talks about the direction of memory access and its performance? – Reconstruction 24/11, 2021 at 8:53

@OliverTušla It's called cache prefetching en.wikipedia.org/wiki/Cache_prefetching and you should be able to find more info by searching for this term. I don't remember where I first learned of it but my original source would probably be outdated by now anyway. Cache control instructions might also interest you. More generally, you can also try searching for "cache performance" or "memory access optimization", etc. Implementation of cache prefetching is of course hardware dependent. – Stinky 1/2, 2022 at 2:57

On some older CPUs there are/were instructions like DJNZ == "decrement and jump if not zero". This allowed for efficient loops where you loaded an initial count value into a register and then you could effectively manage a decrementing loop with one instruction. We're talking 1980s ISAs here though - your teacher is seriously out of touch if he thinks this "rule of thumb" still applies with modern CPUs.

Electrostatic answered 12/5, 2010 at 22:29 Comment(0)

Is it faster to count down than up?

Maybe. But far more than 99% of the time it won't matter, so you should use the most 'sensible' test for terminating the loop, and by sensible, I mean that it takes the least amount of thought by a reader to figure out what the loop is doing (including what makes it stop). Make your code match the mental (or documented) model of what the code is doing.

If the loop is working it's way up through an array (or list, or whatever), an incrementing counter will often match up better with how the reader might be thinking of what the loop is doing - code your loop this way.

But if you're working through a container that has N items, and are removing the items as you go, it might make more cognitive sense to work the counter down.

A bit more detail on the 'maybe' in the answer:

It's true that on most architectures, testing for a calculation resulting in zero (or going from zero to negative) requires no explicit test instruction - the result can be checked directly. If you want to test whether a calculation results in some other number, the instruction stream will generally have to have an explicit instruction to test for that value. However, especially with modern CPUs, this test will usually add less than noise-level additional time to a looping construct. Particularly if that loop is performing I/O.

On the other hand, if you count down from zero, and use the counter as an array index, for example, you might find the code working against the memory architecture of the system - memory reads will often cause a cache to 'look ahead' several memory locations past the current one in anticipation of a sequential read. If you're working backwards through memory, the caching system might not anticipate reads of a memory location at a lower memory address. In this case, it's possible that looping 'backwards' might hurt performance. However, I'd still probably code the loop this way (as long as performance didn't become an issue) because correctness is paramount, and making the code match a model is a great way to help ensure correctness. Incorrect code is as unoptimized as you can get.

So I would tend to forget the professor's advice (of course, not on his test though - you should still be pragmatic as far as the classroom goes), unless and until the performance of the code really mattered.

Disgruntle answered 12/5, 2010 at 22:39 Comment(0)

Bob,

Not until you are doing microoptimizations, at which point you will have the manual for your CPU to hand. Further, if you were doing that sort of thing, you probably wouldn't be needing to ask this question anyway. :-) But, your teacher evidently doesn't subscribe to that idea....

There are 4 things to consider in your loop example:

for (i=N; 
 i>=0;             //thing 1
 i--)             //thing 2
{
  putchar('*');   //thing 3
}

Comparison

Comparison is (as others have indicated) relevant to particular processor architectures. There are more types of processors than those that run Windows. In particular, there might be an instruction that simplifies and speeds up comparisons with 0.

Adjustment

In some cases, it is faster to adjust up or down. Typically a good compiler will figure it out and redo the loop if it can. Not all compilers are good though.

Loop Body

You are accessing a syscall with putchar. That is massively slow. Plus, you are rendering onto the screen (indirectly). That is even slower. Think 1000:1 ratio or more. In this situation, the loop body totally and utterly outweighs the cost of the loop adjustment/comparison.

Caches

A cache and memory layout can have a large effect on performance. In this situation, it doesn't matter. However, if you were accessing an array and needed optimal performance, it would behoove you to investigate how your compiler and your processor laid out memory accessses and to tune your software to make the most of that. The stock example is the one given in relation to matrix multiplication.

Ballew answered 12/5, 2010 at 23:0 Comment(0)

It can be faster.

On the NIOS II processor I'm currently working with, the traditional for loop

for(i=0;i<100;i++)

produces the assembly:

ldw r2,-3340(fp) %load i to r2
addi r2,r2,1     %increase i by 1
stw r2,-3340(fp) %save value of i
ldw r2,-3340(fp) %load value again (???)
cmplti r2,r2,100 %compare if less than equal 100
bne r2,zero,0xa018 %jump

If we count down

for(i=100;i--;)

we get an assembly that needs 2 instructions less.

ldw r2,-3340(fp)
addi r3,r2,-1
stw r3,-3340(fp)
bne r2,zero,0xa01c

If we have nested loops, where the inner loop is executed a lot, we can have a measurable difference:

int i,j,a=0;
for(i=100;i--;){
    for(j=10000;j--;){
        a = j+1;
    }
}

If the inner loop is written like above, the execution time is: 0.12199999999999999734 seconds. If the inner loop is written the traditional way, the execution time is: 0.17199999999999998623 seconds. So the loop counting down is about 30% faster.

But: this test was made with all GCC optimizations turned off. If we turn them on, the compiler is actually smarter than this handish optimization and even keeps the value in a register during the whole loop and we would get an assembly like

addi r2,r2,-1
bne r2,zero,0xa01c

In this particular example the compiler even notices, that variable a will allways be 1 after the loop execution and skips the loops alltogether.

However I experienced that sometimes if the loop body is complex enough, the compiler is not able to do this optimization, so the safest way to always get a fast loop execution is to write:

register int i;
for(i=10000;i--;)
{ ... }

Of course this only works, if it does not matter that the loop is executed in reverse and like Betamoo said, only if you are counting down to zero.

Trusty answered 13/10, 2015 at 1:49 Comment(0)

It is an interesting question, but as a practical matter I don't think it's important and does not make one loop any better than the other.

According to this wikipedia page: Leap second, "...the solar day becomes 1.7 ms longer every century due mainly to tidal friction." But if you are counting days until your birthday, do you really care about this tiny difference in time?

It's more important that the source code is easy to read and understand. Those two loops are a good example of why readability is important -- they don't loop the same number of times.

I would bet that most programmers read (i = 0; i < N; i++) and understand immediately that this loops N times. A loop of (i = 1; i <= N; i++), for me anyway, is a little less clear, and with (i = N; i > 0; i--) I have to think about it for a moment. It's best if the intent of the code goes directly into the brain without any thinking required.

Tied answered 13/5, 2010 at 1:44 Comment(1)

The both constructs are exactly as easy to understand. There are some people that claim that if you have 3 or 4 repetitions, it's better to copy the instruction than to make a loop because it is for them easier to understand. – Recollected 10/5, 2013 at 8:45

Strangely, it appears that there IS a difference. At least, in PHP. Consider following benchmark:

<?php

print "<br>".PHP_VERSION;
$iter = 100000000;
$i=$t1=$t2=0;

$t1 = microtime(true);
for($i=0;$i<$iter;$i++){}
$t2 = microtime(true);
print '<br>$i++ : '.($t2-$t1);

$t1 = microtime(true);
for($i=$iter;$i>0;$i--){}
$t2 = microtime(true);
print '<br>$i-- : '.($t2-$t1);

$t1 = microtime(true);
for($i=0;$i<$iter;++$i){}
$t2 = microtime(true);
print '<br>++$i : '.($t2-$t1);

$t1 = microtime(true);
for($i=$iter;$i>0;--$i){}
$t2 = microtime(true);
print '<br>--$i : '.($t2-$t1);

Results are interesting:

PHP 5.2.13
$i++ : 8.8842368125916
$i-- : 8.1797409057617
++$i : 8.0271911621094
--$i : 7.1027431488037


PHP 5.3.1
$i++ : 8.9625310897827
$i-- : 8.5790238380432
++$i : 5.9647901058197
--$i : 5.4021768569946

If someone knows why, it would be nice to know :)

EDIT: Results are the same even if you start counting not from 0, but other arbitrary value. So there is probably not only comparison to zero which makes a difference?

Allergen answered 13/5, 2010 at 2:6 Comment(6)

The reason it's slower is that the prefix operator doesn't need to store a temporary. Consider $foo = $i++; Three things happen: $i is stored to a temporary, $i is incremented, and then $foo is assigned that temporary's value. In the case of $i++; a smart compiler could realize the temporary is unnecessary. PHP just doesn't. C++ and Java compilers are smart enough to make this simple optimization. – Dieldrin 13/5, 2010 at 8:32

and why $i-- is faster than $i++ ? – Allergen 13/5, 2010 at 10:3

How many iterations of your benchmark did you run? Did you clip outriders and take an average for each result? Was your computer doing anything else during the benchmarks? That ~0.5 difference could just be the result of other CPU activity, or pipeline utilisation, or... or... well, you get the idea. – Merciful 13/5, 2010 at 12:2

Yes, here i am giving averages. Benchmark was runned on different machines, and difference is accidentally. – Allergen 13/5, 2010 at 13:44

@Conspicuous Compiler => you know or you suppose? – Allergen 13/5, 2010 at 14:17

PHP has so much interpretation overhead around it that little things such as incrementing/decrementing shouldn't make any difference whatsoever. I doubt there's any good reason why $i-- should have been any different to $i++. – Lentamente 31/12, 2011 at 0:21

What your teacher have said was some oblique statement without much clarification. It is NOT that decrementing is faster than incrementing but you can create much much faster loop with decrement than with increment.

Without going on at length about it, without need of using loop counter etc - what matters below is just speed and loop count (non zero).

Here is how most people implement loop with 10 iterations:

int i;
for (i = 0; i < 10; i++)
{
    //something here
}

For 99% of cases it is all one may need but along with PHP, PYTHON, JavaScript there is the whole world of time critical software (usually embedded, OS, games etc) where CPU ticks really matter so look briefly at assembly code of:

int i;
for (i = 0; i < 10; i++)
{
    //something here
}

after compilation (without optimisation) compiled version may look like this (VS2015):

-------- C7 45 B0 00 00 00 00  mov         dword ptr [i],0  
-------- EB 09                 jmp         labelB 
labelA   8B 45 B0              mov         eax,dword ptr [i]  
-------- 83 C0 01              add         eax,1  
-------- 89 45 B0              mov         dword ptr [i],eax  
labelB   83 7D B0 0A           cmp         dword ptr [i],0Ah  
-------- 7D 02                 jge         out1 
-------- EB EF                 jmp         labelA  
out1:

The whole loop is 8 instructions (26 bytes). In it - there are actually 6 instructions (17 bytes) with 2 branches. Yes yes I know it can be done better (its just an example).

Now consider this frequent construct which you will often find written by embedded developer:

i = 10;
do
{
    //something here
} while (--i);

It also iterates 10 times (yes I know i value is different compared with shown for loop but we care about iteration count here). This may be compiled into this:

00074EBC C7 45 B0 01 00 00 00 mov         dword ptr [i],1  
00074EC3 8B 45 B0             mov         eax,dword ptr [i]  
00074EC6 83 E8 01             sub         eax,1  
00074EC9 89 45 B0             mov         dword ptr [i],eax  
00074ECC 75 F5                jne         main+0C3h (074EC3h)

5 instructions (18 bytes) and just one branch. Actually there are 4 instruction in the loop (11 bytes).

The best thing is that some CPUs (x86/x64 compatible included) have instruction that may decrement a register, later compare result with zero and perform branch if result is different than zero. Virtually ALL PC cpus implement this instruction. Using it the loop is actually just one (yes one) 2 byte instruction:

00144ECE B9 0A 00 00 00       mov         ecx,0Ah  
label:
                          // something here
00144ED3 E2 FE                loop        label (0144ED3h)  // decrement ecx and jump to label if not zero

Do I have to explain which is faster?

Now even if particular CPU does not implement above instruction all it requires to emulate it is a decrement followed by conditional jump if result of previous instruction happens to be zero.

So regardless of some cases that you may point out as an comment why I am wrong etc etc I EMPHASISE - YES IT IS BENEFICIAL TO LOOP DOWNWARDS if you know how, why and when.

PS. Yes I know that wise compiler (with appropriate optimisation level) will rewrite for loop (with ascending loop counter) into do..while equivalent for constant loop iterations ... (or unroll it) ...

Covered answered 9/5, 2017 at 18:22 Comment(0)

No, that's not really true. One situation where it could be faster is when you would otherwise be calling a function to check the bounds during every iteration of a loop.

for(int i=myCollection.size(); i >= 0; i--)
{
   ...
}

But if it's less clear to do it that way, it's not worthwhile. In modern languages, you should use a foreach loop when possible, anyway. You specifically mention the case where you should use a foreach loop -- when you don't need the index.

Sair answered 12/5, 2010 at 22:3 Comment(1)

To be clear and efficient you should be in the habit of at least for(int i=0, siz=myCollection.size(); i<siz; i++). – Nautilus 13/5, 2010 at 0:59

regardless of the direction always use the prefix form (++i instead of i++)!

for (i=N; i>=0; --i)

for (i=0; i<N; ++i)

Explanation: http://www.eskimo.com/~scs/cclass/notes/sx7b.html

Furthermore you can write

for (i=N; i; --i)

But i would expect modern compilers to be able to do exactly these optimizations.

Windswept answered 12/5, 2010 at 22:4 Comment(7)

Never seen people complain about that before. But after reading the link it actually makes sense :) Thank you. – Trust 12/5, 2010 at 22:8

Um, why should he always use the prefix form? If there's no assignment going on, they are identical, and the article you linked to even says that postfix form is more common. – Sulfite 12/5, 2010 at 22:9

Why should one always use the prefix form? In this instance, it's semantically identical. – Thrum 12/5, 2010 at 22:9

The postfix form can potentially create an unnecessary copy of the object, although if the value is never being used, the compiler will probably optimize it to the prefix form anyway. – Enthetic 12/5, 2010 at 22:12

Out of force of habit, I always do --i and i++ because when I learned C computers usually had a register predecrement and postincrement, but not vice versa. Thus, *p++ and *--p were faster than *++p and *p-- because the former two could be done in one 68000 machine code instruction. – Morse 14/5, 2010 at 13:7

I tend to read ++i as "increment i", which makes more sense than "i increment" as the postfix form implies. That's probably just me, though. – Lentamente 31/12, 2011 at 0:25

Well, if you're using C++, it is always preferrable to use ++i and --i, because in case i happens to be a non-integral type (an iterator), the code for pre-increment and pre-decrement is simpler, as it doesn't require making a copy of the object as in the case of post-increment or post-decrement. It should almost always optimize away, but still ... – Imprecation 30/10, 2012 at 13:1

The point is that when counting down you don't need to check i >= 0 separately to decrementing i. Observe:

for (i = 5; i--;) {
  alert(i);  // alert boxes showing 4, 3, 2, 1, 0
}

Both the comparison and decrementing i can be done in the one expression.

See other answers for why this boils down to fewer x86 instructions.

As to whether it makes a meaningful difference in your application, well I guess that depends on how many loops you have and how deeply nested they are. But to me, it's just as readable to do it this way, so I do it anyway.

Belonging answered 13/5, 2010 at 1:42 Comment(1)

I think this is poor style, because it depends on the reader knowing that the return value of i-- is the old value of i, for the possible value of saving a cycle. That'd only be significant if there were lots of loop iterations, and the cycle was a significant fraction of the length of the iteration, and actually showed up at run time. Next, someone will try for (i=5; --i;) because they've heard that in C++ you might want to avoid creating some temporary when i is a non-trivial type, and now you're in bug land having callously thrown away your opportunity to make wrong code look wrong. – Thematic 14/4, 2014 at 15:33

Now, I think you had enough assembly lectures:) I would like to present you another reason for top->down approach.

The reason to go from the top is very simple. In the body of the loop, you might accidentally change the boundary, which might end in incorrect behaviour or even non-terminating loop.

Look at this small portion of Java code (the language does not matter I guess for this reason):

    System.out.println("top->down");
    int n = 999;
    for (int i = n; i >= 0; i--) {
        n++;
        System.out.println("i = " + i + "\t n = " + n);
    }
    System.out.println("bottom->up");
    n = 1;
    for (int i = 0; i < n; i++) {
        n++;
        System.out.println("i = " + i + "\t n = " + n);
    }

So my point is you should consider prefering going from the top down or having a constant as a boundary.

Omegaomelet answered 13/5, 2010 at 0:20 Comment(5)

Huh?!! You failing example is really counter-intuitive, which is to say, a straw-man argument - no one would ever write this. One would write for (int i=0; i < 999; i++) {. – Nautilus 13/5, 2010 at 1:12

@Software Monkey imagine n being a result of some computation... e.g. you might want to iterate over some collection and its size is the boundary, but as some side effect, you add new elements to the collection in the loop body. – Timoteo 13/5, 2010 at 8:8

If that's what you intended to communicate, then that's what your example should illustrate: for(int xa=0; xa<collection.size(); xa++) { collection.add(SomeObject); ... } – Nautilus 13/5, 2010 at 18:20

@Software Monkey I wanted to be more general than just talk particularly about collections, because what I am reasoning about has nothing to do with collections – Timoteo 13/5, 2010 at 18:42

Yes, but if you are going to reason by example, your examples need to be credible and illustrative of the point. – Nautilus 14/5, 2010 at 3:52

-1

At an assembler level a loop that counts down to zero is generally slightly faster than one that counts up to a given value. If the result of a calculation is equal to zero most processors will set a zero flag. If subtracting one makes a calculation wrap around past zero this will normally change the carry flag (on some processors it will set it on others it will clear it), so the comparison with zero comes essentially for free.

This is even more true when the number of iterations is not a constant but a variable.

In trivial cases the compiler may be able to optimise the count direction of a loop automatically but in more complex cases it may be that the programmer knows that the direction of the loop is irrelevant to the overall behaviour but the compiler cannot prove that.

Poolroom answered 29/12, 2016 at 17:38 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags