How to tell the compiler to unroll this loop [duplicate]

Asked 15/4, 2013 at 18:35 Answered 15/4, 2013 at 19:29

I have the following loop that I am running on an ARM processor.

// pin here is pointer to some part of an array
for (i = 0; i < v->numelements; i++)
{
    pe   = pptr[i];
    peParent = pe->parent;

    SPHERE  *ps = (SPHERE *)(pe->data);

    pin[0] = FLOAT2FIX(ps->rad2);
    pin[1] = *peParent->procs->pe_intersect == &SphPeIntersect;
    fixifyVector( &pin[2], ps->center ); // Is an inline function

    pin = pin + 5;
}

By the slow performance of the loop, I can judge that the compiler was unable to unroll this loop, as when I manually do the unrolling, it becomes quite fast. I think the compiler is getting confused by the pin pointer. Can we use restrict keyword to help the compiler here, or is restrict only reserved for function parameters? In general how can we tell the compiler to unroll it and don't worry about the pin pointer.

Ploy answered 15/4, 2013 at 18:35 Comment(10)

Did you measure execution times on a debug or release build? – Wellmeaning 15/4, 2013 at 18:36

Release build with -O3 optimization. – Ploy 15/4, 2013 at 18:37

have you tried to assign v->numelements to a local and using that in the for loop? Could be the compiler cannot unroll the loop because it has to assume the value of v->numelements will be changed in fixifyVector. – Wellmeaning 15/4, 2013 at 18:44

fixifyVector is inlined, so I don't think that is the problem. – Ploy 15/4, 2013 at 18:47

gcc also has -funroll-loops optimization flag looking at the docs it has to be enabled separately from -O3 – Kuhn 15/4, 2013 at 18:52

I don't see anything in that code that would be helped significantly by unrolling the loop. You're going to have to look at the generated object code for the two cases to figure out what's going on. – Menon 15/4, 2013 at 19:2

@RobertPrior You should post it as an answer since I bet that's it. I don't think any compiler will do (heavy) loop unrolling unasked for, since it is quite ineffective in terms of program memory space. – Primeval 15/4, 2013 at 19:3

@JohnR.Strohm Doesn't that depend on numelements? If it is in millions, you can avoid many code jumps and thus comparisons by unrolling. Or are there other benefits to loop unrolling that cannot be gained in this segment? – Dispend 15/4, 2013 at 19:13

You might want to add the specific ARM CPU, it is probably important to a performance related question. – Meissen 15/4, 2013 at 19:35

@jsn, what I see is an array lookup, several pointer manipulations, and a function invocation. My gut feel is that these will completely dominate the per-iteration time, compared to the loop overhead. The guy could make a significant improvement by caching v->numelements in the loop initialization, instead of fetching it every time through, but that shouldn't be that expensive an operation. – Menon 8/6, 2013 at 15:15

To tell gcc to unroll all loops you can use the optimization flag -funroll-loops.

To unroll only a specific loop you can use:

__attribute__((optimize("unroll-loops")))

see this answer for more details.

Edit

If the compiler cannot determine the number of iterations of the loop upon entry you will need to use -funroll-all-loops. Note that from the documentation: "Unroll all loops, even if their number of iterations is uncertain when the loop is entered. This usually makes programs run more slowly."

Kuhn answered 15/4, 2013 at 19:12 Comment(0)

If you extent pptr size by one, you can use the pld instruction.

  __asm__ __volatile__("pld\t[%0]" :: "r" (pptr[i+1]));

Or alternatively you may need to pre-load the next peParent and SPHERE *ps. The loop overhead on an ARM is very small. It is unlikely that un-rolling the loop will be a significant benefit. There are no loop variable constants. It is more likely that the compiler's scheduler is able to fetch advanced data before it is used when you have un-rolled the loop.

You have not presented all of the code to see the data dependencies. There maybe other variables that would benefit from being pre-loaded. Giving a complete example would probably help everyone answer your question.

Meissen answered 15/4, 2013 at 19:29 Comment(1)

This answer is more likely on Cortex type CPU's where the pipeline is larger. The specific ARM CPU wasn't mentioned. – Meissen 15/4, 2013 at 19:34

Recommended topics

Hot tags