Speed of nested if vs for loop
Asked Answered
E

3

5

I recently came across this piece of code in an interrupt service routine (ISR):

#define MAX_CHANNELS 4
static uint16_t volatile* ADCVALS[MAX_CHANNELS] = {
    &ADC1BUF0, &ADC1BUF1, &ADC1BUF2, &ADC1BUF3
};
static uint8_t CHANNELS = 0;
static uint16_t volatile* volatile BUFFER_IDX[MAX_CHANNELS];

void __attribute__((interrupt, no_auto_psv)) _AD1Interrupt(void) {
    *(BUFFER_IDX[0]++) = *ADCVALS[0];
    if (CHANNELS >= 1) {
        *(BUFFER_IDX[1]++) = *ADCVALS[1];
        if (CHANNELS >= 2) {
            *(BUFFER_IDX[2]++) = *ADCVALS[2];
             if (CHANNELS >= 3) {
                *(BUFFER_IDX[3]++) = *ADCVALS[3];
            }
        }
    }
}

It copies between 1-4 register values into memory, depending on the value of CHANNELS, which is a value between 0-3 which is set elsewhere in the program via a setter function.

I found the nested if's extremely ugly and changed it to this:

int i;
for (i = 0; i <= CHANNELS; i++) {
    *(BUFFER_IDX[i]++) = *ADCVALS[i];
}

which promptly broke the ISR. This is an embedded system, PIC24 architecture, 64 MHz clock. The ISR is severely time constrained and must finish within 1 µs. The for loop is apparently too slow, while the nested if is fast enough.

My question, then, is two-fold:

  1. Is there a less ugly way to do what the nested if clauses do, without slowing down the ISR?
  2. Why is the for loop so much slower? I would have expected the compiler (xc16) to be smart enough to generate similar asm for both (at -O2).
Etoile answered 10/10, 2022 at 8:13 Comment(7)
For arrays, a loop condition like i <= CHANNELS is almost always wrong. Usually something like i < CHANNELS is correct.Darnley
Also, have you looked at the generated assembly code? What is CHANNELS? Is it a macro (common convention is to use all-uppercase words for macros, not for variables) or an enumeration?Darnley
CHANNELS is a value between 0-3, where 0 means copy one value, and 3 means copy four. BUFFER_IDX is an array of pointers to a preallocated memory region. The region is split into 1-4 sections, depending on the value of CHANNELS. So i <= CHANNELS is correct, since the loop should run for CHANNELS == 3 but not for CHANNELS == 4.Etoile
The final if condition should probably be if (CHANNELS == 3) instead of if (CHANNELS >= 3), since the only valid values for CHANNELS is 0-3. In that case, I believe the two code examples are equivalent, no?Etoile
As already mentioned, you should look and compare the generated assembly. It is hard to predict performance by looking at the C code alone. If you have issues in high performance code, you cannot just blindly trust compiler to do the right thing.Bacchae
This question doesn't make much sense and cannot be answered without knowing the type/declarations of BUFFER_IDX and CHANNELS and so on. Start with disassembling the original code and take it from there. The only thing I can tell you from this code is that you shouldn't be using naive types such as int when coding embedded systems. In case of PIC, using int may make the code needlessly slow for nothing gained and the compiler might not be able to optimize away such basic mistakes.Insider
@Etoile As for why int is slow, it's pretty obvious. It's an 8-bit MCU so swap it for uint8_t.Insider
H
5
for (int i = 0; i <= CHANNELS; i++) {
    *(BUFFER_IDX[i]++) = *ADCVALS[i];
}

Is

        mov     DWORD PTR [rbp-4], 0
        jmp     .L2
.L3:
        mov     eax, DWORD PTR [rbp-4]
        cdqe
        mov     rax, QWORD PTR [rbp-80+rax*8]
        mov     edx, DWORD PTR [rax]
        mov     eax, DWORD PTR [rbp-4]
        cdqe
        mov     rax, QWORD PTR [rbp-48+rax*8]
        lea     rsi, [rax+4]
        mov     ecx, DWORD PTR [rbp-4]
        movsx   rcx, ecx
        mov     QWORD PTR [rbp-48+rcx*8], rsi
        mov     DWORD PTR [rax], edx
        add     DWORD PTR [rbp-4], 1
.L2:
        mov     eax, DWORD PTR [rbp-4]
        cmp     eax, DWORD PTR [rbp-8]
        jle     .L3

And

*(BUFFER_IDX[0]++) = *ADCVALS[0];
if (CHANNELS >= 1) {
    *(BUFFER_IDX[1]++) = *ADCVALS[1];
    if (CHANNELS >= 2) {
        *(BUFFER_IDX[2]++) = *ADCVALS[2];
        if (CHANNELS >= 3) {
            *(BUFFER_IDX[3]++) = *ADCVALS[3];
        }
    }
}

Is

mov     rax, QWORD PTR [rbp-80]
mov     edx, DWORD PTR [rax]
mov     rax, QWORD PTR [rbp-48]
lea     rcx, [rax+4]
mov     QWORD PTR [rbp-48], rcx
mov     DWORD PTR [rax], edx
cmp     DWORD PTR [rbp-4], 0
jle     .L2
mov     rax, QWORD PTR [rbp-72]
mov     edx, DWORD PTR [rax]
mov     rax, QWORD PTR [rbp-40]
lea     rcx, [rax+4]
mov     QWORD PTR [rbp-40], rcx
mov     DWORD PTR [rax], edx
cmp     DWORD PTR [rbp-4], 1
jle     .L2
mov     rax, QWORD PTR [rbp-64]
mov     edx, DWORD PTR [rax]
mov     rax, QWORD PTR [rbp-32]
lea     rcx, [rax+4]
mov     QWORD PTR [rbp-32], rcx
mov     DWORD PTR [rax], edx
cmp     DWORD PTR [rbp-4], 2
jle     .L2
mov     rax, QWORD PTR [rbp-56]
mov     edx, DWORD PTR [rax]
mov     rax, QWORD PTR [rbp-24]
lea     rcx, [rax+4]
mov     QWORD PTR [rbp-24], rcx
mov     DWORD PTR [rax], edx

How you can see nested if will do less jumps but current compilers can optimize it and with -O3 flag you will get something like this

        mov     eax, DWORD PTR [rsp+12]
        test    eax, eax
        js      .L2
        mov     rax, QWORD PTR [rsp+48]
        mov     edx, DWORD PTR [rax]
        mov     rax, QWORD PTR [rsp+16]
        mov     DWORD PTR [rax], edx
        mov     eax, DWORD PTR [rsp+12]
        test    eax, eax
        jle     .L2
        mov     rax, QWORD PTR [rsp+56]
        mov     edx, DWORD PTR [rax]
        mov     rax, QWORD PTR [rsp+24]
        mov     DWORD PTR [rax], edx
        mov     eax, DWORD PTR [rsp+12]
        sub     eax, 1
        jle     .L2
        mov     rax, QWORD PTR [rsp+64]
        mov     edx, DWORD PTR [rax]
        mov     rax, QWORD PTR [rsp+32]
        mov     DWORD PTR [rax], edx
        mov     eax, DWORD PTR [rsp+12]
        cmp     eax, 2
        jle     .L2
        mov     rax, QWORD PTR [rsp+72]
        mov     edx, DWORD PTR [rax]
        mov     rax, QWORD PTR [rsp+40]
        mov     DWORD PTR [rax], edx
        mov     eax, DWORD PTR [rsp+12]
.L2:

That has +- same performance as nested if-s

Holler answered 10/10, 2022 at 8:42 Comment(3)
Thank you for the explanation. I suspected -O3 might make the for loop faster, but I'm using the free version of xc16 which only goes to -O2. I suppose I'm stuck the the nested if's for now.Etoile
@Etoile on -O2 it is loop and it really little bit slower, but code more readableHoller
Performance of some x86 and a PIC are vastly different things. It's completely senseless to optimize in favour of branch prediction for example.Insider
D
2

I'm gonna go out on a limb ('cuz I don't usually get this close to the metal).

switch( CHANNELS ) {
    case 3: *(BUFFER_IDX[3]++) = *ADCVALS[3]; /* Fallthroughs all the way down */
    case 2: *(BUFFER_IDX[2]++) = *ADCVALS[2];
    case 1: *(BUFFER_IDX[1]++) = *ADCVALS[1];
    default:*(BUFFER_IDX[0]++) = *ADCVALS[0];
}

This is 'prettier', and trades-off branching with the fetch/assign operations.
Would be nice if someone familiar with Assembly and performance code would please (kindly and gently) evaluate this alternative.


EDIT
New information about the relationship of ISR duration and # of 'channels' read suggests this might be an alternative that is easier on the eyes.

void __attribute__((interrupt, no_auto_psv)) _AD1Interrupt(void) {
    *(BUFFER_IDX[0]++) = *ADCVALS[0]; if (CHANNELS == 0) goto doneHere;
    *(BUFFER_IDX[1]++) = *ADCVALS[1]; if (CHANNELS == 1) goto doneHere;
    *(BUFFER_IDX[2]++) = *ADCVALS[2]; if (CHANNELS == 2) goto doneHere;
    *(BUFFER_IDX[3]++) = *ADCVALS[3];
doneHere:
    ; // a "sequence point"
}

One has to overcome one's aversion to goto. The compiler is indifferent. Coders should be too.

Dethrone answered 10/10, 2022 at 10:15 Comment(7)
Be careful on GCC warnings too.Kuwait
@tilz0R Yes, this is "rough"... I only use an old compiler, not as 'finicky' as current versions.... Just wondering about the C -> assembly of this. Probably worse than a modern, clever optimiser would achieve... Noted, though. Thank you...Dethrone
There's no obvious reason why this would be faster or slower than a for loop. Older embedded systems compilers are known to optimize switch statements poorly, however. The PIC compiles in particular always had a real nasty reputation in regards to pretty much everything, including optimizations. Only one way to tell and that's to disassemble with the compiler used.Insider
I actually tried this already. It is faster, but not as fast as the nested if's (and not fast enough, unfortunately).Etoile
@Etoile I think this is not place with bad performanceHoller
This is not the same as the OP's code when CHANNELS is greater than 3.Cardinalate
@JohnBollinger Moot point now that th OP has commented here that this is "not fast enough", but... 3rd comment under the original question, OP writes "CHANNELS is a value between 0-3,.." Just going on the clues at the time... All good... :-)Dethrone
E
1

I did as several commenters suggested and looked at the assembly generated by the compiler for the different codes. I also realized that the 1 µs time constraint is only valid for the case where CHANNELS == 0. For values greater than zero, the program bottlenecks elsewhere, which relaxes the time constraint on the ISR somewhat.

Here's the asm for the nested if's:

mov _BUFFER_IDX,w0
mov _ADC1BUF0,w1
mov w1,[w0++]
mov w0,_BUFFER_IDX
mov.b   _CHANNELS,WREG
bra z,.L79
mov _BUFFER_IDX+2,w1
mov _ADC1BUF1,w2
mov w2,[w1++]
mov w1,_BUFFER_IDX+2
sub.b   w0,#1,[w15]
bra leu,.L79
mov _BUFFER_IDX+4,w1
mov _ADC1BUF2,w4
mov w4,[w1++]
mov w1,_BUFFER_IDX+4
sub.b   w0,#2,[w15]
bra z,.L79
mov _BUFFER_IDX+6,w0
mov _ADC1BUF3,w5
mov w5,[w0++]
mov w0,_BUFFER_IDX+6
.L79
retfie

And here it is for the for loop:

mov #_CHANNELS,w6
mov.b   [w6],w6
ze  w6,w6
mov #_ADCVALS-2,w3
clr w0
mov #_BUFFER_IDX,w5
.L79:
add w0,w0,w2
add w5,w2,w2
mov [w2],w1
mov [++w3],w4
mov [w4],[w1++]
mov w1,[w2]
inc w0,w0
sub w0,w6,[w15]
bra leu,.L79
retfie

These are all single-cycle operations, except for retfie (return from interrupt), which takes 6 cycles. For the case CHANNELS == 0, the nested if's use a total of 12 cycles, while the for loop uses 21.

For the switch-case suggested in another answer:

switch( CHANNELS ) {
    case 3: *(BUFFER_IDX[3]++) = *ADCVALS[3]; /* Fallthroughs all the way down */
    case 2: *(BUFFER_IDX[2]++) = *ADCVALS[2];
    case 1: *(BUFFER_IDX[1]++) = *ADCVALS[1];
    default:*(BUFFER_IDX[0]++) = *ADCVALS[0];
}

the asm looks like this:

mov.b   _CHANNELS,WREG
sub.b   w0,#2,[w15]
bra z,.L81
sub.b   w0,#3,[w15]
bra z,.L82
sub.b   w0,#1,[w15]
bra z,.L80
.L79:
mov _BUFFER_IDX,w0
mov _ADC1BUF0,w5
mov w5,[w0++]
mov w0,_BUFFER_IDX
.L82:
mov _BUFFER_IDX+6,w0
mov _ADC1BUF3,w1
mov w1,[w0++]
mov w0,_BUFFER_IDX+6
.L81:
mov _BUFFER_IDX+4,w0
mov _ADC1BUF2,w2
mov w2,[w0++]
mov w0,_BUFFER_IDX+4
.L80:
mov _BUFFER_IDX+2,w0
mov _ADC1BUF1,w4
mov w4,[w0++]
mov w0,_BUFFER_IDX+2
bra .L79
retfie

which uses 18 cycles for CHANNELS == 0.

For curiosity, I also tested a do-while loop:

uint16_t i = 0;
do {
    *(BUFFER_IDX[i]++) = *ADCVALS[i];
} while (i++ < CHANNELS);

which turns out to be the slowest of the bunch at 22 cycles:

mov #_ADCVALS-2,w3
mov #_CHANNELS,w6
mov.b   [w6],w6
ze  w6,w6
inc w6,w6
clr w0
mov #_BUFFER_IDX,w5
.L79:
add w0,w0,w2
add w5,w2,w2
mov [w2],w1
mov [++w3],w4
mov [w4],[w1++]
mov w1,[w2]
inc w0,w0
sub w0,w6,[w15]
bra nz,.L79
retfie

I am a bit surprised that a difference of only 9 cycles is enough to break the time constraint, but I suppose the ISR must have been only just fast enough, so those extra cycles pushed it over the edge.

Etoile answered 10/10, 2022 at 17:36 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.