Loop unroll (with bitwise operations)

Asked 13/9, 2012 at 7:23 Answered 13/9, 2012 at 7:43

Solved c linux-kernel bit-manipulation loop-unrolling

I am writing a Linux Kernel driver (for ARM) and in an irq handler I need to check the interrupt bits.

bit
 0/16  End point 0 In/Out interrupt
       (very likely, while In is more likely)
 1/17  End point 1 In/Out interrupt
 ...
15/31  End point 15 In/Out interrupt

Note that more than a bit can be set at a time.

So this is the code:

int i;
u32 intr = read_interrupt_register();

/* ep0 IN */
if(likely(intr & (1 << 0))){
    handle_ep0_in();
}

/* ep0 OUT */
if(likely(intr & (1 << 16))){
    handle_ep0_out();
}

for(i=1;i<16;++i){
    if(unlikely(intr & (1 << i))){
        handle_ep_in(i);
    }
    if(unlikely(intr & (1 << (i + 16)))){
        handle_ep_out(i);
    }
}

(1 << 0) and (1 << 16) would be calculated in compile time, however (1 << i) and (1 << (i + 16)) wouldn't. Also there would be integral comparison and addition in the loop.

Because it is an irq handler, work should be done within the shortest time. This let me think whether I need to optimize it a bit.

Possible ways?

1. Split the loop, seems to make no difference...

/* ep0 IN */
if(likely(intr & (1 << 0))){
    handle_ep0_in();
}

/* ep0 OUT */
if(likely(intr & (1 << 16))){
    handle_ep0_out();
}

for(i=1;i<16;++i){
    if(unlikely(intr & (1 << i))){
        handle_ep_in(i);
    }
}
for(i=17;i<32;++i){
    if(unlikely(intr & (1 << i))){
        handle_ep_out(i - 16);
    }
}

2. Shift intr instead of the value to be compared to?

/* ep0 IN */
if(likely(intr & (1 << 0))){
    handle_ep0_in();
}

/* ep0 OUT */
if(likely(intr & (1 << 16))){
    handle_ep0_out();
}

for(i=1;i<16;++i){
    intr >>= 1;
    if(unlikely(intr & 1)){
        handle_ep_in(i);
    }
}
intr >>= 1;
for(i=1;i<16;++i){
    intr >>= 1;
    if(unlikely(intr & 1)){
        handle_ep_out(i);
    }
}

3. Fully unroll the loop (not shown). That would make the code a bit messy.

4. Any other better ways?

5. Or it's that the compiler will actually generate the most optimized way?

Edit: I was looking for a way to tell the gcc compiler to unroll that particular loop, but it seems that it isn't possible according to my search...

Burmaburman answered 13/9, 2012 at 7:23 Comment(1)

You have only 17 elements to handle. Manually unrolled it is not messier than the code in your first example – Buff 13/9, 2012 at 7:28

If we can assume that the number of set bits in intr is low (as it is usually the case in interrupt masks) we can optimize a little bit and write a loop that executes for each bit only once:

void handle (int intr)
{
  while (intr)
  {
    // find index of lowest bit set in intr:
    int bit_id = __builtin_ffs(intr)-1;

    // call handler:
    if (bit_id > 16)
      handle_ep_out (bit_id-16);
    else
      handle_ep_in (bit_id);

    // clear that bit
    // (I think there was a bit-hack out there to simplify this step even further)
    intr -= (1<<bit_id);
  }
}

On most ARM architectures __builtin_ffs will compile down to a CLZ instruction and some arithmetic around it. It should do so for anything but ARM7 and older cores.

Also: When writing interrupt handlers on embedded devices the size of the function makes a difference for performance as well because the instructions have to be loaded into the code-cache. Lean code usually executes faster. A bit overhead is okay if you save memory accesses to memory that is unlikely to be in the cache.

Inhalant answered 13/9, 2012 at 7:43 Comment(9)

You missed the special cases for the argument-less functions handle_ep0_in and handle_ep0_out, but +1 for ffs – Buff 13/9, 2012 at 7:48

Also I don't know if __builtin_ffs is allowed in the kernel, but they very likely have some replacement for it if it is not allowed. – Inhalant 13/9, 2012 at 7:49

Why would it not be allowed. If it by some chance is not, you can use clz via inline asm directly. – Extreme 13/9, 2012 at 7:54

@nils-pipenbrinck: Yes, ffs, with arch-specific versions if available in __ffs. – Buff 13/9, 2012 at 7:55

@dbrank0: often the reason for such restrictions is the compiler support library, which is not used in the kernel. Search for "udivdi3" here. – Buff 13/9, 2012 at 7:57

I just checked the code compiled with -O3 for cortex-a8.. Looks neat. 10 instructions in the loop-body including the loop control and the branches to the handle_op_in/out functions. – Inhalant 13/9, 2012 at 8:0

I think my interrupt handler cannot be small indeed. There is only a single irq for that device, and there are more than 50 interrupts to handle inside it. The code shown above is only some of them. – Burmaburman 13/9, 2012 at 8:10

Just found ffs in LXR. But what is the difference of ffs(x) and __ffs(x)? #define __ffs(x) (ffs(x) - 1) – Burmaburman 13/9, 2012 at 8:12

It comes out to be the same but for the logic of it I would go for intr ^= (1<<bit_id). – Ammonal 13/9, 2012 at 9:8

I would probably go for option 5 myself. Code for readability and let gcc's insane optimisation level -O3 do what it can.

I've seen code generated at that level that I can't even understand.

Any hand-crafted optimisation in C (other than possibly unrolling and using constants rather than runtime bit shifts, a la option 3) is unlikely to outperform what the compiler itself can do.

I think you'll find that the unrolling may not be as messy as you think:

if (  likely(intr & 0x00000001)) handle_ep0_in();
if (  likely(intr & 0x00010000)) handle_ep0_out();

if (unlikely(intr & 0x00000002)) handle_ep_in(1);
if (unlikely(intr & 0x00020000)) handle_ep_out(1);

:

if (unlikely(intr & 0x00008000)) handle_ep_in(15);
if (unlikely(intr & 0x80000000)) handle_ep_out(15);

In fact, you can make it a lot less messier with macros (untested, but you should get the general idea):

// Since mask is a constant, "mask << 32" should be too.

# define chkintr (mask, num) \
    if (unlikely(intr & (mask      ))) handle_ep_in  (num); \
    if (unlikely(intr & (mask << 32))) handle_ep_out (num);

// Special case for high probability bit.

if (likely(intr & 0x00000001UL)) handle_ep0_in();
if (likely(intr & 0x00010000UL)) handle_ep0_out();

chkintr (0x0002UL,  1);  chkintr (0x0004UL,  2);  chkintr (0x0008UL,  3);
chkintr (0x0010UL,  4);  chkintr (0x0020UL,  5);  chkintr (0x0040UL,  6);
chkintr (0x0080UL,  7);  chkintr (0x0100UL,  8);  chkintr (0x0200UL,  9);
chkintr (0x0400UL, 10);  chkintr (0x0800UL, 11);  chkintr (0x1000UL, 12);
chkintr (0x2000UL, 13);  chkintr (0x4000UL, 14);  chkintr (0x8000UL, 15);

The only step up from there is hand-coding assembly language and there's still the good possibility that gcc may be able to outperform you :-)

Ramiform answered 13/9, 2012 at 7:30 Comment(3)

Well I may me over-worried because it wouldn't really use too many time, but I still don't want to lie to myself :P. Also I think that Linux Kernel by default builds with optimization level 2. – Burmaburman 13/9, 2012 at 7:36

Level 2 may well be enough. Certainly, there are things in the kernel already with pretty stringent timing requirements, so maybe -O2 is enough. -O3 would probably make kernel debugging quite a nightmare. Bottom line advice, don't worry about this until you know it's a problem. Both the loop and the unrolled form will probably be more than fast enough. – Ramiform 13/9, 2012 at 7:39

well, lets see if there would be more answers. – Burmaburman 13/9, 2012 at 7:44

Possible ways?

Recommended topics

Hot tags