Problems with ADC/SBB and INC/DEC in tight loops on some CPUs
Asked Answered
S

2

17

I am writing a simple BigInteger type in Delphi. It mainly consists of a dynamic array of TLimb, where a TLimb is a 32 bit unsigned integer, and a 32 bit size field, which also holds the sign bit for the BigInteger.

To add two BigIntegers, I create a new BigInteger of the appropriate size and then, after some bookkeeping, call the following procedure, passing it three pointers to the respective starts of the arrays for the left and right operand and the result, as well as the numbers of limbs for left and right, respectively.

Plain code:

class procedure BigInteger.PlainAdd(Left, Right, Result: PLimb; LSize, RSize: Integer); 
asm
// EAX = Left, EDX = Right, ECX = Result
        PUSH    ESI
        PUSH    EDI
        PUSH    EBX
        MOV     ESI,EAX                 // Left
        MOV     EDI,EDX                 // Right
        MOV     EBX,ECX                 // Result
        MOV     ECX,RSize               // Number of limbs at Left
        MOV     EDX,LSize               // Number of limbs at Right
        CMP     EDX,ECX
        JAE     @SkipSwap
        XCHG    ECX,EDX                 // Left and LSize should be largest
        XCHG    ESI,EDI                 // so swap
@SkipSwap:
        SUB     EDX,ECX                 // EDX contains rest
        PUSH    EDX                     // ECX contains smaller size
        XOR     EDX,EDX                  
@MainLoop:
        MOV     EAX,[ESI + CLimbSize*EDX]  // CLimbSize = SizeOf(TLimb) = 4.
        ADC     EAX,[EDI + CLimbSize*EDX]
        MOV     [EBX + CLimbSize*EDX],EAX
        INC     EDX
        DEC     ECX
        JNE     @MainLoop
        POP     EDI                        
        INC     EDI                        // Do not change Carry Flag
        DEC     EDI
        JE      @LastLimb
@RestLoop:
        MOV     EAX,[ESI + CLimbSize*EDX]
        ADC     EAX,ECX
        MOV     [EBX + CLimbSize*EDX],EAX
        INC     EDX
        DEC     EDI
        JNE     @RestLoop
@LastLimb:
        ADC     ECX,ECX                    // Add in final carry
        MOV     [EBX + CLimbSize*EDX],ECX
@Exit:
        POP     EBX
        POP     EDI
        POP     ESI
end;
// RET is inserted by Delphi compiler.

This code worked well, and I was pretty satisified with it, until I noticed that, on my development setup (Win7 in a Parallels VM on an iMac) a simple PURE PASCAL addition routine, doing the same while emulating the carry with a variable and a few if clauses, was faster than my plain, straightforward handcrafted assembler routine.

It took me a while to find out that on certain CPUs (including my iMac and an older laptop), the combination of DEC or INC and ADC or SBB could be extremely slow. But on most of my others (I have five other PCs to test it on, although four of these are exactly the same), it was quite fast.

So I wrote a new version, emulating INC and DEC using LEA and JECXZ instead, like so:

Part of emulating code:

@MainLoop:
        MOV     EAX,[ESI + EDX*CLimbSize]
        LEA     ECX,[ECX - 1]                   // Avoid INC and DEC, see above.
        ADC     EAX,[EDI + EDX*CLimbSize]
        MOV     [EBX + EDX*CLimbSize],EAX
        LEA     EDX,[EDX + 1]
        JECXZ   @DoRestLoop                     // LEA does not modify Zero flag, so JECXZ is used.
        JMP     @MainLoop
@DoRestLoop:
// similar code for the rest loop 

That made my code on the "slow" machines almost three times as fast, but some 20% slower on the "faster" machines. So now, as initialzation code, I do a simple timing loop and use that to decide if I'll set up the unit to call the plain or the emulated routine(s). This is almost always correct, but sometimes it chooses the (slower) plain routines when it should have chosen the emulating routines.

But I don't know if this is the best way to do this.

Question

I gave my solution, but do the asm gurus here perhaps know a better way to avoid the slowness on certain CPUs?

Update

Peter and Nils' answers helped me a lot to get on the right track. This is the main part of my final solution for the DEC version:

Plain code:

class procedure BigInteger.PlainAdd(Left, Right, Result: PLimb; LSize, RSize: Integer);
asm
        PUSH    ESI
        PUSH    EDI
        PUSH    EBX
        MOV     ESI,EAX                         // Left
        MOV     EDI,EDX                         // Right
        MOV     EBX,ECX                         // Result
        MOV     ECX,RSize
        MOV     EDX,LSize
        CMP     EDX,ECX
        JAE     @SkipSwap
        XCHG    ECX,EDX
        XCHG    ESI,EDI
@SkipSwap:
        SUB     EDX,ECX
        PUSH    EDX
        XOR     EDX,EDX
        XOR     EAX,EAX
        MOV     EDX,ECX
        AND     EDX,$00000003
        SHR     ECX,2
        CLC
        JE      @MainTail
@MainLoop:
        // Unrolled 4 times. More times will not improve speed anymore.
        MOV     EAX,[ESI]
        ADC     EAX,[EDI]
        MOV     [EBX],EAX
        MOV     EAX,[ESI + CLimbSize]
        ADC     EAX,[EDI + CLimbSize]
        MOV     [EBX + CLimbSize],EAX
        MOV     EAX,[ESI + 2*CLimbSize]
        ADC     EAX,[EDI + 2*CLimbSize]
        MOV     [EBX + 2*CLimbSize],EAX
        MOV     EAX,[ESI + 3*CLimbSize]
        ADC     EAX,[EDI + 3*CLimbSize]
        MOV     [EBX + 3*CLimbSize],EAX
        // Update pointers.
        LEA     ESI,[ESI + 4*CLimbSize]
        LEA     EDI,[EDI + 4*CLimbSize]
        LEA     EBX,[EBX + 4*CLimbSize]
        // Update counter and loop if required.
        DEC     ECX                             
        JNE     @MainLoop
@MainTail:
        // Add index*CLimbSize so @MainX branches can fall through.
        LEA     ESI,[ESI + EDX*CLimbSize]
        LEA     EDI,[EDI + EDX*CLimbSize]
        LEA     EBX,[EBX + EDX*CLimbSize]
        // Indexed jump.
        LEA     ECX,[@JumpsMain]
        JMP     [ECX + EDX*TYPE Pointer]
        // Align jump table manually, with NOPs. Update if necessary.
        NOP
// Jump table.
@JumpsMain:
        DD      @DoRestLoop
        DD      @Main1
        DD      @Main2
        DD      @Main3
@Main3:
        MOV     EAX,[ESI - 3*CLimbSize]
        ADC     EAX,[EDI - 3*CLimbSize]
        MOV     [EBX - 3*CLimbSize],EAX
@Main2:
        MOV     EAX,[ESI - 2*CLimbSize]
        ADC     EAX,[EDI - 2*CLimbSize]
        MOV     [EBX - 2*CLimbSize],EAX
@Main1:
        MOV     EAX,[ESI - CLimbSize]
        ADC     EAX,[EDI - CLimbSize]
        MOV     [EBX - CLimbSize],EAX
@DoRestLoop:

// etc...    

I removed a lot of white space, and I guess the reader can get the rest of the routine. It is similar to the main loop. A speed improvement of approx. 20% for larger BigIntegers, and some 10% for small ones (only a few limbs).

The 64 bit version now uses 64 bit addition where possible (in the main loop and in Main3 and Main2, which are not "fall-through" like above) and before, 64 bit was quite a lot slower than 32 bit, but now it is 30% faster than 32 bit and twice as fast as the original simple 64 bit loop.

Update 2

Intel proposes, in its Intel 64 and IA-32 Architectures Optimization Reference Manual, 3.5.2.6 Partial Flag Register Stalls -- Example 3-29:

        XOR     EAX,EAX

        .ALIGN  16

@MainLoop:

        ADD     EAX,[ESI]       // Sets all flags, so no partial flag register stall
        ADC     EAX,[EDI]       // ADD added in previous carry, so its result might have carry
        MOV     [EBX],EAX
        MOV     EAX,[ESI + CLimbSize]
        ADC     EAX,[EDI + CLimbSize]
        MOV     [EBX + CLimbSize],EAX
        MOV     EAX,[ESI + 2*CLimbSize]
        ADC     EAX,[EDI + 2*CLimbSize]
        MOV     [EBX + 2*CLimbSize],EAX
        MOV     EAX,[ESI + 3*CLimbSize]
        ADC     EAX,[EDI + 3*CLimbSize]
        MOV     [EBX + 3*CLimbSize],EAX
        SETC    AL              // Save carry for next iteration
        MOVZX   EAX,AL
        ADD     ESI,CUnrollIncrement*CLimbSize  // LEA has slightly worse latency
        ADD     EDI,CUnrollIncrement*CLimbSize
        ADD     EBX,CUnrollIncrement*CLimbSize
        DEC     ECX
        JNZ     @MainLoop

The flag is saved in AL, and through MOVZX in EAX. It is added in through the first ADD in the loop. Then an ADC is needed, because the ADD might generate a carry. Also see comments.

Because the carry is saved in EAX, I can also use ADD to update the pointers. The first ADD in the loop also updates all flags, so ADC won't suffer from a partial flag register stall.

Softfinned answered 18/8, 2015 at 23:25 Comment(5)
This will be relevant. I actually also believe JECXZ is slow on some (possibly not the same) architectures. I would refer to guys like Agner Fog for better information than I can give, though.Reimburse
ADD would completely upset the carry flag, so I would have to emulate that. I tried that, but the emulation cost more time than the improvement from using ADD could give me. I even tried SSE, with a speed improvement over my older code, but the emulating code I posted gave the best result, until now. Before, I tried to avoid ADC by using ADD and emulating the carry flag, I tried to avoid ADC by using SSE and emulating the carry flag and I tried to get rid of INC and DEC by the code above.But i have a feeling I missed something obvious.Softfinned
If you can use GPLed code in your project, use GMP's existing asm routines. If you can link to LGPLed libraries, then do that instead. gmplib.org. GMP has very carefully hand-tuned routines for multi-precision integers. Also, obviously use 64bit code if you can. If BigInt performance is an issue for your code, it's going to be worth shipping a 64bit version which has double the BigInt performance.Jactitation
@500-InternalServerError: jecxz is only 2 uops on Intel, vs. 1 for a macro-fused test&branch. It's only one total macro-op on AMD. It's not nearly as slow as the LOOP instruction. This looks like a case where it's justified, since you need to loop without affecting flags. Nils' unrolled version amortizes the cost nicely.Jactitation
@PeterCordes: I think I could use GMP, but I want to do everything myself. I also implemented a .NET-compatible Decimal type just for the fun of it.Softfinned
J
20

What you're seeing on old P6-family CPUs is a partial-flag stall.
Early Sandybridge-family handles merging more efficiently, and later SnB-family (e.g. Skylake) has no merging cost at all: uops that need both CF and some flags from the SPAZO group read them as 2 separate inputs.

Intel CPUs (other than P4) rename each flag bit separately, so JNE only depends on the last instruction that set all the flags it uses (in this case, just the Z flag). In fact, recent Intel CPUs can even internally combine an inc/jne into a single inc-and-branch uop (macro-fusion). However, the trouble comes when reading a flag bit that was left unmodified by the last instruction that updated any flags.

Agner Fog says Intel CPUs (even PPro / PII) don't stall on inc / jnz. It's not actually the inc/jnz that's stalling, it's the adc in the next iteration that has to read the CF flag after inc wrote other flags but left CF unmodified.

; Example 5.21. Partial flags stall when reading unmodified flag bits
cmp eax, ebx
inc ecx
jc xx
; Partial flags stall  (P6 / PIII / PM / Core2 / Nehalem)

Agner Fog also says more generally: "Avoid code that relies on the fact that INC or DEC leaves the carry flag unchanged." (for Pentium M/Core2/Nehalem). The suggestion to avoid inc/dec entirely is obsolete, and only applied to P4. Other CPUs rename different parts of EFLAGS separately, and only have trouble when merging is required (reading a flag that was unmodified by the last insn to write any flags).

On the machines where it's fast (Sandybridge and later), they're inserting an extra uop to merge the flags register when you read bits that weren't written by the last instruction that modified it. This is much faster than stalling for 7 cycles, but still not ideal.

P4 always tracks whole registers, instead of renaming partial registers, not even EFLAGS. So inc/jz has a "false" dependency on whatever wrote the flags before it. This means that the loop condition can't detect the end of the loop until execution of the adc dep chain gets there, so the branch mispredict that can happen when the loop-branch stops being taken can't be detected early. It does prevent any partial-flags stalls, though.

Your lea / jecxz avoids the problem nicely. It's slower on SnB and later because you didn't unroll your loop at all. Your LEA version is 11 uops (can issue one iteration per 3 cycles), while the inc version is 7 uops (can issue one iter per 2 cycles), not counting the flag-merging uop it inserts instead of stalling.

If the loop instruction wasn't slow, it would be perfect for this. It's actually fast on AMD Bulldozer-family (1 m-op, same cost as a fused compare-and-branch), and Via Nano3000. It's bad on all Intel CPUs, though (7 uops on SnB-family).


Unrolling

When you unroll, you can get another small gain from using pointers instead of indexed addressing modes, because 2-reg addressing modes can't micro-fuse on SnB and later. A group of load/adc/store instructions is 6 uops without micro-fusion, but only 4 with micro-fusion. CPUs can issue 4 fused-domain uops/clock. (See Agner Fog's CPU microarch doc, and instruction tables, for details on this level.)

Save uops when you can to make sure the CPU can issue instructions faster than execute, to make sure it can see far enough ahead in the instruction stream to absorb any bubbles in insn fetch (e.g. branch mispredict). Fitting in the 28uop loop buffer also means power savings (and on Nehalem, avoiding instruction-decoding bottlenecks.) There are things like instruction alignment and crossing uop cache-line boundaries that make it hard to sustain a full 4 uops / clock without the loop buffer, too.

Another trick is to keep pointers to the end of your buffers, and count up towards zero. (So at the start of your loop, you get the first item as end[-idx].)

        ; pure loads are always one uop, so we can still index it
        ; with no perf hit on SnB
        add     esi, ecx   ; point to end of src1
        neg     ecx

UNROLL equ 4
@MainLoop:
        MOV     EAX, [ESI + 0*CLimbSize + ECX*CLimbSize]
        ADC     EAX, [EDI + 0*CLimbSize]
        MOV     [EBX + 0*CLimbSize], EAX

        MOV     EAX, [ESI + 1*CLimbSize + ECX*CLimbSize]
        ADC     EAX, [EDI + 1*CLimbSize]
        MOV     [EBX + 1*CLimbSize], EAX

        ; ... repeated UNROLL times.  Use an assembler macro to repeat these 3 instructions with increasing offsets

        LEA     ECX, [ECX+UNROLL] ; loop counter

        LEA     EDI, [EDI+ClimbSize*UNROLL]  ; Unrolling makes it worth doing
        LEA     EBX, [EBX+ClimbSize*UNROLL]  ; a separate increment to save a uop for every ADC and store on SnB & later.

        JECXZ   @DoRestLoop                     // LEA does not modify Zero flag, so JECXZ is used.
        JMP     @MainLoop
@DoRestLoop:

An unroll of 4 should be good. No need to overdo it, since you're prob. going to be able to saturate the load/store ports of pre-Haswell with an unroll of just 3 or 4, maybe even 2.

An unroll of 2 will make the above loop exactly 14 fused-domain uops for Intel CPUs. adc is 2 ALU (+1 fused memory), jecxz is 2, the rest (including LEA) are all 1. In the unfused domain, 10 ALU/branch and 6 memory (well, 8 memory if you really count store-address and store-data separately).

  • 14 fused-domain uops per iteration: issue one iteration per 4 clocks. (The odd 2 uops at the end have to issue as a group of 2, even from the loop buffer.)
  • 10 ALU & branch uops: Takes 3.33c to execute them all on pre-haswell. I don't think any one port will be a bottleneck, either: adc's uops can run on any port, and lea can run on p0/p1. The jumps use port5 (and jecx also uses one of p0/p1)
  • 6 memory operations: Takes 3c to execute on pre-Haswell CPUs, which can handle 2 per clock. Haswell added a dedicated AGU for stores so it can sustain 2load+1store/clock.

So for pre-haswell CPUs, using LEA/JECXZ, an unroll of 2 won't quite saturate either the ALU or the load/store ports. An unroll of 4 will bring it up to 22 fused uops (6 cycles to issue). 14 ALU&branch: 4.66c to execute. 12 memory: 6 cycles to execute. So an unroll of 4 will saturate pre-Haswell CPUs, but only just barely. The CPU won't have any buffer of instructions to churn through on a branch mispredict.

Haswell and later will always be bottlenecked on the frontend (4 uops per clock limit), because the load/adc/store combo takes 4 uops, and can be sustained at one per clock. So there's never any "room" for loop overhead without cutting into adc throughput. This is where you have to know not to overdo it and unroll too much.

On Broadwell / Skylake, adc is only a single uop with 1c latency, and load / adc r, m / store appears to be the best sequence. adc m, r/i is 4 uops. This should sustain one adc per clock, like AMD.

On AMD CPUs, adc is only one macro-op, so if the CPU can sustain an issue rate of 4 (i.e. no decoding bottlenecks), then they can also use their 2 load / 1 store port to beat Haswell. Also, jecxz on AMD is as efficient as any other branch: only one macro-op. Multi-precision math is one of the few things AMD CPUs are good at. Lower latencies on some integer instructions give them an advantage in some GMP routines.


An unroll of more than 5 might hurt performance on Nehalem, because that would make the loop bigger than the 28uop loop buffer. Instruction decoding would then limit you to less than 4 uops per clock. On even earlier (Core2), there's a 64B x86-instruction loop buffer (64B of x86 code, not uops), which helps some with decode.

Unless this adc routine is the only bottleneck in your app, I'd keep the unroll factor down to maybe 2. Or maybe even not unroll, if that saves a lot of prologue / epilogue code, and your BigInts aren't too big. You don't want to bloat the code too much and create cache misses when callers call lots of different BigInteger functions, like add, sub, mul, and do other things in between. Unrolling too much to win at microbenchmarks can shoot yourself in the foot if your program doesn't spend a long time in your inner loop on each call.

If your BigInt values aren't usually gigantic, then it's not just the loop you have to tune. A smaller unroll could be good to simplify the prologue/epilogue logic. Make sure you check lengths so ECX doesn't cross zero without ever being zero, of course. This is the trouble with unrolling and vectors. :/


Saving / restoring CF for old CPUs, instead of flag-less looping:

This might be the most efficient way:

lahf
# clobber flags
sahf              ; cheap on AMD and Intel.  This doesn't restore OF, but we only care about CF

# or

setc al
# clobber flags
add  al, 255      ; generate a carry if al is non-zero

Using the same register as the adc dep chain isn't actually a problem: eax will always be ready at the same time as the CF output from the last adc. (On AMD and P4/Silvermont partial-reg writes have a false dep on the full reg. They don't rename partial regs separately). The save/restore is part of the adc dep chain, not the loop condition dep chain.

The loop condition only checks flags written by cmp, sub, or dec. Saving/restoring flags around it doesn't make it part of the adc dep chain, so the branch mispredict at the end of the loop can be detected before adc execution gets there. (A previous version of this answer got this wrong.)


There's almost certainly some room to shave off instructions in the setup code, maybe by using registers where values start. You don't have to use edi and esi for pointers, although I know it makes initial development easier when you're using registers in ways consistent with their "traditional" use. (e.g. destination pointer in EDI).

Does Delphi let you use ebp? It's nice to have a 7th register.

Obviously 64bit code would make your BigInt code run about twice as fast, even though you'd have to worry about doing a single 32b adc at the end of a loop of 64bit adc. It would also give you 2x the amount of registers.

Jactitation answered 19/8, 2015 at 5:21 Comment(15)
Whoa! I'll need some time to digest (understand) all of it, but it looks good. I got the general ideas, and I already read about this partial flag stall. Unfortunately, the built-in assembler (BASM) of Delphi does not have macros, so I'll have to do some copy&paste or I'll write a little generator to generate the unrolled loops. Thanks!Softfinned
What about unrolling, and saving the carry at the end of the loop using RCL EAX,1, doing all the pointer/index stuff at the end of the loop and at the start of the loop, doing RCR EAX,1 again, to restore the carry? Or is that too slow? I know I could simply try and measure, but unfortunately not right here and now. I only need EAX again inside the (unrolled) loop.Softfinned
Delphi lets me use everything I like. EBP is needed for local variables, but I don't need any in this function anyway.Softfinned
FWIW, in my previous trials, I did use the single register addressing modes and incremented pointers too, but could not measure any difference with indexed access. Perhaps in the unrolled loop?Softfinned
@RudyVelthuis: Interesting idea to use rcl/rcr to save flags, but those instructions are slow (3 uops each, according to Agner Fog's insn table, which you should download if you're tuning asm :). If you want to save/restore the carry flag, do it with setc/add 1. sahf/lahf misses the overflow flag, so could cause a partial-flag stall for adc. But I think your best bet is to do the looping without flags, using JECXZ (unless getting the loop condition right is proving inconvenient with unrolling).Jactitation
@RudyVelthuis: If you use ebp, you probably need to push / pop it, unfortunately. Delphi will probably put a mov esp, ebp in your function epilogue, if it always builds a stack frame with ebp in the prologue.Jactitation
@RudyVelthuis: single-register addressing modes: yes, it's unrolling that makes the diff. On SnB and later, adc eax, [r1 + r2*4] is 2 uops, because the load can't micro-fuse. adc eax, [r1] is one uop, because the load does micro-fuse. (This isn't mentioned in Agner Fog's manuals; it was discovered and confirmed more recently than his last update.) Without unrolling, you're just turning a 2-uop instruction into two 1-uop instructions. With unrolling, you're saving uops in the loop. A group of load/adc/store is 5 uops without micro-fusion. CPUs can issue 4 fused-domain uops/clock.Jactitation
Your throughput is limited on pre-haswell by memory ports, though. SnB/IvB can do 2 memory ops / clock, up to one of them being a store. Haswell can do 3. I think AMD can also do 2 loads + one store. Nehalem and earlier can do only one load and one store per clock.Jactitation
@RudyVelthuis: updated the answer to include some of what I said in comments, and ended up expanding on the perf analysis and tradeoffs of unrolling. And then fixing it because I got it wrong the first time, xD.Jactitation
Delphi does not always build a stack frame. In this case, since there are more than 3 arguments, LSize and RSize are passed on the stack, but as soon as you have those in a register, you can push EBP (and must). I did some tests, and I'll have to do a little more, I guess. The unrolled loop alone, without replacement of the two-register access, did not make things faster on a "fast" CPU (the only one I have, at the moment -- I'm on vacation), actually it is a bit slower. I imagine it might be a lot faster than the emulated loop on "slow" CPUs, though. I'll report as soon as I can test this.Softfinned
FWIW, I did an unroll of 8. I'll try to cut that down to 4 and see what happens. I must admit that I'll have to re-read your answer a few times before I get all of it. It is quite heavy stuff. <g>Softfinned
@RudyVelthuis: no faster with unrolling: This is not suprising if you are still using 2-register addressing modes for adc and store. And yes, you have to save ebp, it'd just be nice if delphi didn't waste time messing with it itself. (like gcc -fomit-frame-pointer, where gcc just uses whatever offset is needed from the current value of esp.) BTW, I wouldn't say "emulated". I'd say "flagless", or "jecxz" loop. I think the "jecxz" version is a good bet even on SnB and later, with a bit of unrolling. Depending on your use-case, make sure you tune it well for small limb-counts too.Jactitation
I called it emulated, internally, because I started all this by emulating the carry, like the (then faster) PUREPASCAL routine did. The name stuck, but I realize it is not a good name.Softfinned
FWIW, it was faster with unrolling, although not that much, but only if I used DEC ECX; JNE @... again (ECX shifted right appropriately before the loop), instead of LEA and JECXZ. I could not test on a "slow" CPU yet.Softfinned
I finally had some time to implement the single-register addressing mode. That made a difference of about 12% (overall speed improvement) for the very long BigIntegers. I don't think there is much that could be squeezed out now. I tried with much smaller BigIntegers (1 to 10 limbs at most, average ~3 limbs), and there was no significant timing difference with the plain simple loop, so I will use the optimized unrolled loop routine, i.e. this last version, as replacement. It has become a lot less easy to read, so I had to add a few more comments. <g>Softfinned
C
8

There are so many x86 chips with vastly different timing in use that you can't realistically have optimal code for all of them. Your approach to have two known good functions and benchmark before use is already pretty advanced.

However, depending on the size of your BigIntegers you can likely improve your code by simple loop-unrolling. That will remove the loop overhead drastically.

E.g. you could execute a specialized block that does the addition of eight integers like this:

@AddEight:
        MOV     EAX,[ESI + EDX*CLimbSize + 0*CLimbSize]
        ADC     EAX,[EDI + EDX*CLimbSize + 0*CLimbSize]
        MOV     [EBX + EDX*CLimbSize + 0*CLimbSize],EAX
        MOV     EAX,[ESI + EDX*CLimbSize + 1*CLimbSize]
        ADC     EAX,[EDI + EDX*CLimbSize + 1*CLimbSize]
        MOV     [EBX + EDX*CLimbSize + 1*CLimbSize],EAX
        MOV     EAX,[ESI + EDX*CLimbSize + 2*CLimbSize]
        ADC     EAX,[EDI + EDX*CLimbSize + 2*CLimbSize]
        MOV     [EBX + EDX*CLimbSize + 2*CLimbSize],EAX
        MOV     EAX,[ESI + EDX*CLimbSize + 3*CLimbSize]
        ADC     EAX,[EDI + EDX*CLimbSize + 3*CLimbSize]
        MOV     [EBX + EDX*CLimbSize + 3*CLimbSize],EAX
        MOV     EAX,[ESI + EDX*CLimbSize + 4*CLimbSize]
        ADC     EAX,[EDI + EDX*CLimbSize + 4*CLimbSize]
        MOV     [EBX + EDX*CLimbSize + 4*CLimbSize],EAX
        MOV     EAX,[ESI + EDX*CLimbSize + 5*CLimbSize]
        ADC     EAX,[EDI + EDX*CLimbSize + 5*CLimbSize]
        MOV     [EBX + EDX*CLimbSize + 5*CLimbSize],EAX
        MOV     EAX,[ESI + EDX*CLimbSize + 6*CLimbSize]
        ADC     EAX,[EDI + EDX*CLimbSize + 6*CLimbSize]
        MOV     [EBX + EDX*CLimbSize + 6*CLimbSize],EAX
        MOV     EAX,[ESI + EDX*CLimbSize + 7*CLimbSize]
        ADC     EAX,[EDI + EDX*CLimbSize + 7*CLimbSize]
        MOV     [EBX + EDX*CLimbSize + 7*CLimbSize],EAX
        LEA     ECX,[ECX - 8]

Now you rebuild your loop, execute the above block as long as you have more than 8 elements to process and do the remaining few elements using the single element addition loop that you already have.

For large BitIntegers you'll spend most of the time in the unrolled part which should execute a lot faster now.

If you want it even faster, then write seven additional blocks that are specialized to the remaining element counts and branch to them based on the element count. This can best be done by storing the seven addresses in a lookup table, loading up the address from it and directly jumping into the specialized code.

For small element counts this completely removes the entire loop and for large elements you'll get the full benefit of the unrolled loop.

Chud answered 18/8, 2015 at 23:48 Comment(3)
Thanks, that gave me a kick in the right direction. Makes code a little more complex, but should give a considerable improvement. I'll test it. Problem is, that I am not at home now (and this laptop does not display the problem), so I can only test it coming Monday.Softfinned
I'll make a special loop-unrolled version to be called for larger BigIntegers. I'll test how large "larger" is.Softfinned
@RudyVelthuis Btw, for optimization problems that need a lot of almost identical specialized routines I usually write myself a short program to generate the assembler code. Changing the arithmetic or size becomes very convenient that way.Chud

© 2022 - 2024 — McMap. All rights reserved.