What matters much more than whether you're increasing or decreasing your counter is whether you're going up memory or down memory. Most caches are optimized for going up memory, not down memory. Since memory access time is the bottleneck that most programs today face, this means that changing your program so that you go up memory might result in a performance boost even if this requires comparing your counter to a non-zero value. In some of my programs, I saw a significant improvement in performance by changing my code to go up memory instead of down it.
Skeptical? Just write a program to time loops going up/down memory. Here's the output that I got:
Average Up Memory = 4839 mus
Average Down Memory = 5552 mus
Average Up Memory = 18638 mus
Average Down Memory = 19053 mus
(where "mus" stands for microseconds) from running this program:
#include <chrono>
#include <iostream>
#include <random>
#include <vector>
using namespace std;
//Sum all numbers going up memory.
template<class Iterator, class T>
inline void sum_abs_up(Iterator first, Iterator one_past_last, T &total) {
T sum = 0;
auto it = first;
do {
sum += *it;
it++;
} while (it != one_past_last);
total += sum;
}
//Sum all numbers going down memory.
template<class Iterator, class T>
inline void sum_abs_down(Iterator first, Iterator one_past_last, T &total) {
T sum = 0;
auto it = one_past_last;
do {
it--;
sum += *it;
} while (it != first);
total += sum;
}
//Time how long it takes to make num_repititions identical calls to sum_abs_down().
//We will divide this time by num_repitions to get the average time.
template<class T>
chrono::nanoseconds TimeDown(vector<T> &vec, const vector<T> &vec_original,
size_t num_repititions, T &running_sum) {
chrono::nanoseconds total{0};
for (size_t i = 0; i < num_repititions; i++) {
auto start_time = chrono::high_resolution_clock::now();
sum_abs_down(vec.begin(), vec.end(), running_sum);
total += chrono::high_resolution_clock::now() - start_time;
vec = vec_original;
}
return total;
}
template<class T>
chrono::nanoseconds TimeUp(vector<T> &vec, const vector<T> &vec_original,
size_t num_repititions, T &running_sum) {
chrono::nanoseconds total{0};
for (size_t i = 0; i < num_repititions; i++) {
auto start_time = chrono::high_resolution_clock::now();
sum_abs_up(vec.begin(), vec.end(), running_sum);
total += chrono::high_resolution_clock::now() - start_time;
vec = vec_original;
}
return total;
}
template<class Iterator, typename T>
void FillWithRandomNumbers(Iterator start, Iterator one_past_end, T a, T b) {
random_device rnd_device;
mt19937 generator(rnd_device());
uniform_int_distribution<T> dist(a, b);
for (auto it = start; it != one_past_end; it++)
*it = dist(generator);
return ;
}
template<class Iterator>
void FillWithRandomNumbers(Iterator start, Iterator one_past_end, double a, double b) {
random_device rnd_device;
mt19937_64 generator(rnd_device());
uniform_real_distribution<double> dist(a, b);
for (auto it = start; it != one_past_end; it++)
*it = dist(generator);
return ;
}
template<class ValueType>
void TimeFunctions(size_t num_repititions, size_t vec_size = (1u << 24)) {
auto lower = numeric_limits<ValueType>::min();
auto upper = numeric_limits<ValueType>::max();
vector<ValueType> vec(vec_size);
FillWithRandomNumbers(vec.begin(), vec.end(), lower, upper);
const auto vec_original = vec;
ValueType sum_up = 0, sum_down = 0;
auto time_up = TimeUp(vec, vec_original, num_repititions, sum_up).count();
auto time_down = TimeDown(vec, vec_original, num_repititions, sum_down).count();
cout << "Average Up Memory = " << time_up/(num_repititions * 1000) << " mus\n";
cout << "Average Down Memory = " << time_down/(num_repititions * 1000) << " mus"
<< endl;
return ;
}
int main() {
size_t num_repititions = 1 << 10;
TimeFunctions<int>(num_repititions);
cout << '\n';
TimeFunctions<double>(num_repititions);
return 0;
}
Both sum_abs_up
and sum_abs_down
do the same thing (sum the vector of numbers) and are timed the same way with the only difference being that sum_abs_up
goes up memory while sum_abs_down
goes down memory. I even pass vec
by reference so that both functions access the same memory locations. Nevertheless, sum_abs_up
is consistently faster than sum_abs_down
. Give it a run yourself (I compiled it with g++ -O3).
It's important to note how tight the loop that I'm timing is. If a loop's body is large (has a lot of code) then it likely won't matter whether its iterator goes up or down memory since the time it takes to execute the loop's body will likely completely dominate. Also, it's important to mention that with some rare loops, going down memory is sometimes faster than going up it. But even with such loops it was never the case that going up memory was always slower than going down (unlike small-bodied loops that go up memory, for which the opposite is frequently true; in fact, for a small handful of loops I've timed, the increase in performance by going up memory was 40+%).
The point is, as a rule of thumb, if you have the option, if the loop's body is small, and if there's little difference between having your loop go up memory instead of down it, then you should go up memory.
FYI vec_original
is there for experimentation, to make it easy to change sum_abs_up
and sum_abs_down
in a way that makes them alter vec
while not allowing these changes to affect future timings. I highly recommend playing around with sum_abs_up
and sum_abs_down
and timing the results.
putchar
is using 99.9999% of the time (give or take). – Nolde++
operators in PHP at one point though...$i++
took longer than++$i
for some reason. – Flipperi
is unsigned, the first loop is an infinite one? – Ramp