C++ Loop Unrolling Performance Difference (Project Euler)

Asked 8/11, 2013 at 19:18 Answered 8/11, 2013 at 20:26

I have a question about a Project Euler question and optimization using loop unrolling.

Problem description: 2520 is the smallest number that can be divided by each of the numbers from 1 to 10 without any remainder. What is the smallest positive number that is evenly divisible by all of the numbers from 1 to 20?

Solution:

#include <iostream>
#include <limits.h>
#include <stdio.h>
#include <time.h>

using namespace std;

int main() {

    clock_t startTime = clock();

    for (int i = 1; i < INT_MAX; i++)
    {
        bool isDivisible = true;

        //CODE BLOCK #1
        /*for (int j = 2; j <= 20; j++)
        {
                if ( i % j != 0)
                {
                        isDivisible = false;
                        break;
                {
        }*/

        //CODE BLOCK #2
        /*if (i % 2 != 0 || i % 3 != 0 ||
                i % 4 != 0 || i % 5 != 0 ||
                i % 6 != 0 || i % 7 != 0 ||
                i % 8 != 0 || i % 9 != 0 ||
                i % 10 != 0 || i % 11 != 0 ||
                i % 12 != 0 || i % 13 != 0 ||
                i % 14 != 0 || i % 15 != 0 ||
                i % 16 != 0 || i % 17 != 0 ||
                i % 18 != 0 || i % 19 != 0 ||
                i % 20 != 0 )
                isDivisible = false;*/

        if (isDivisible)
        {
                cout << "smallest: " << i << endl;

                cout << "Ran in: " << clock() -  startTime  << " cycles" << endl;
                break;
        }
    }

return 0;
}

Now, commenting out either CODE BLOCK #1 or CODE BLOCK #2 gives me the correct answer (232792560). However, CODE BLOCK #2 much faster than CODE BLOCK #1.

CODE BLOCK #1: 3,580,000 cycles (I just added the break into CODE BLOCK #1 and it runs much faster. Still significantly slower than the compound IF statement, however.)

CODE BLOCK #2: 970,000 cycles

Does anyone know why this huge performance difference might occur?

Nullify answered 8/11, 2013 at 19:18 Comment(0)

Using the || means that as soon as one is true, the rest of the conditions are not computed. This would be equivalent to the loop:

    for (int j = 2; j <= 20; j++)
    {
        if ( i % j != 0){
            isDivisible = false;
            break;
        }
    }

If you try this you may find the gap in running times has been narrowed. Any other differences could be attributed to loop overhead, but with optimizations turned on in your compiler I suspect that they would run at the same speed (or at least have much more similar times).

EDIT About the new performance difference:
There are many optimized ways to check divisibility of numbers by constants, for example for N any power of 2 i % N != 0 can be replaced with i & (N-1), others exist as well and are not as obvious.
The compiler knows lots of these little tricks and in the second code block is probably able to optimize most if not all of those divisibility checks (since they are written out directly by you) while in the first code block it has to decide to unroll the loops first and then substitute the loop variable with constants before it's even possible to reason about the different checks.
It's possible this difference is making for better optimized code in block 2 than in block 1.

Szabadka answered 8/11, 2013 at 19:27 Comment(1)

Ahh that makes sense. I just added the break into CODE BLOCK #1 and it runs much faster. Still significantly slower than the compound IF statement, however. With break statement: 3,580,000 cycles – Nullify 8/11, 2013 at 20:0

3,580,000 vs 970,000 isn't loop overhead alone...

In your last kernel, it looks like you intended the block of Up, Down and square to persist between the other loop, but those blocks are "terse" local, so the data they contain is not shared between branches. Unfortunately, your approach wouldn't work even if they were shared between branches.

In your inner loop, the current round of the loop uses data that was calculated in the previous round. It is not entirely trivial to parallelize such loops, and sometimes it cannot be done at all. In your case, a simple solution would be to use atomic operators to increase the Up and Down counters, but it wouldn't be efficient because the atomic operators cause implicit serialization of the operations.

You should probably look into solving this with existing parallel primitives, such as prefix-sums, that have already been optimized. For instance, those in CUB or Thrust.

Fritzie answered 8/11, 2013 at 20:26 Comment(8)

Does this answer belong with a different question? I don't see any arrays or threads in the posted code. – Regulation 8/11, 2013 at 20:56

Not sure what you're referring to. I'm responding to the Project Euler post. – Fritzie 8/11, 2013 at 21:6

Even after your edit I'm not sure how this answer applies. What is "the block of Up, Down, and square"? In addition, instead of trying to parallelize this code there is a trivial (and fast) solution using Least Common Multiples. – Regulation 8/11, 2013 at 21:13

Please elaborate on the least common multiples method. I think loop overhead cannot account for the 4x speedup in the original posters question. My example is the infamous ubiquitous algorithm and runs in Θ(n!) time. I've explored the novel LCM methodology for the understanding of redundancy, and checksums cannot fulfill this purpose. I see no reason not to use heuristic loop blocking for controlling the INT search. – Fritzie 8/11, 2013 at 21:18

You can see my solution here on ideone.com. It's simple and fast. – Regulation 8/11, 2013 at 21:26

OP isn't asking for an algorithm to solve the problem. He has found two. He's asking why the performance is so different. – Fritzie 8/11, 2013 at 21:28

This is correct, I don't have any issues coming up with the correct answer, I just want to know why the number of cycles taken to run the code differs by so much. Please re-read question Blastfurnace. – Nullify 8/11, 2013 at 21:29

@user988398: I don't have a definitive answer for you (which is why I didn't post an answer). My guess is one or more of the usual suspects: quality of compiler code generation, effective use of CPU pipelines, or branch prediction. – Regulation 8/11, 2013 at 21:47

Recommended topics

Hot tags