Masm assembly 8086 carry flag between data word addition
Asked Answered
P

3

1

So I have this problem I'm supposed to solve and I've spent hours trying to figure out the best way to do this, google hasn't been of much help.

The problem is to create a subroutine that is given a list of words that you then add with another list that becomes the output. Its basically a method of working with large numbers.

My code works fine for carry flags within words, but for carry flag from one full word to another it does not work. The first 16 bit word (0005 in the example below) is a flag used to tell my subroutine how many words there are.

For instance, given the following input,

//si     0005 0000 EEEE DDDD CCCC BBBB
//di     0005 0000 1111 2222 3333 4445

when the output expected is:

0005 0001 0000 0000 0000 0000

My code produces:

0005 0000 FFFF FFFF FFFF 0000 

I believe I understand why this is happening for the most part, but am unsure of the best way to solve this issue. I need a low cost method of carrying over a 1 between different chunks of data.

;---------------------------------------
; ADD Subroutine
;---------------------------------------
    .data

    bxx dw 0000h                        ;
    cxx dw 0000h                        ;

    .code
;---------------------------------------
addx:                                   ;
    mov bxx, bx                         ;save incoming register
    mov cxx, cx                         ;save incoming register
    mov bx, si                          ;move n to bl - acts as a cursor
loopAdd:                                ;loop point
    mov cx, [si+bx]                     ;move word at point si+bx into register cx
    ADC [di+bx], cx                     ;add with carry  
    sub bx, 0002h;                      ;decrement cursor by a full word
    cmp bx, 0000h                       ;bx == 0?
    jg loopAdd                          ;no? jump to loop point
end:                                    ;
    mov bx, bxx                         ;return register to original state
    mov cx, cxx                         ;return register to original state
    ret                                 ;return
;---------------------------------------
Pitchblack answered 3/3, 2016 at 7:8 Comment(1)
Remainder: cmp will modify flag CF while mov won't.Cupule
C
2

You have to save the carry flag from the previous iteration.

Try this:

;---------------------------------------
; ADD Subroutine
;---------------------------------------
    .data

    bxx dw 0000h                        ;
    cxx dw 0000h                        ;

    .code
;---------------------------------------
addx:                                   ;
    mov bxx, bx                         ;save incoming register
    mov cxx, cx                         ;save incoming register
    mov bx, si                          ;move n to bl - acts as a cursor
    clc                                 ;clear carry flag
    pushf                               ;save flag register
loopAdd:                                ;loop point
    mov cx, [si+bx]                     ;move word at point si+bx into register cx
    popf                                ;restore saved flag register
    ADC [di+bx], cx                     ;add with carry
    pushf                               ;save flag register
    sub bx, 0002h;                      ;decrement cursor by a full word
    jg loopAdd                          ;if the result is positive, jump to loop point
end:                                    ;
    popf                                ;remove saved flag register from the stack
    mov bx, bxx                         ;return register to original state
    mov cx, cxx                         ;return register to original state
    ret                                 ;return
;---------------------------------------

Note that cmp bx, 0000h isn't needed because cmp is sub except for cmp only modify flags and don't store calculated value, so you can check the result of sub directly.

Cupule answered 3/3, 2016 at 8:26 Comment(2)
pushf/popf is a really clunky solution. Using lea bx, [bx-2] / mov cx, bx / jcxz to loop without affecting flags would be faster, or use the loop instruction (even though it's slow). 16bit effective addresses don't allow a scaled index, IIRC, otherwise you could also loop with dec (which doesn't affect the carry flag), but that causes a partial-flag stall or slowdown on most CPUs. (Much more severe on pre-Sandybridge.) lahf/sahf is another option: those instructions are both fast on Intel (but a bit slow on AMD).Millikan
Thanks for your help!Pitchblack
C
2

OP says he wants a low cost solution to preserve the carry between iterations. @MikeCAT had a solution; @PeterCordes suggested some improvements.

There's one more really nice improvement you can get when doing multiprecision arithmetic, under the assumption that your multiprecison value is "big" (contains many word values), and that's unrolling the inner loop N times, avoiding loop counting/carry flag damage inside the unrolled section. (If your multiprecision arithmetic is not very multi, you don't need a lot of optimization).

I've revised @MikeCAT's answer here, with the assumption that the unrolling should be 8 iterations.

The code has 3 parts: determining how to handle a fragment of 8 words, handling the fragment in an unrolled way, and then handling multiple 8 word chunks efficiently in the main unrolled loop.

For OPs' example of 5 words, this routine never gets to the full unrolled loop. For large multiword values, it does, and I'd expect this routine is probably pretty fast.

[The following code is untested.]

;---------------------------------------
; ADD Subroutine
;   si = pointer to count word of 1st multiprecision value
;   di = pointer to count word of 2nd multiprecision value
;   assert: si->count == di ->count
;   preserves si, di; exits with carry from addition
;---------------------------------------
sizeofword equ 2
;---------------------------------------
add_multiple: ; destroys ax, si, di
    push cx                             ;save incoming register
    mov  cx, [si]
    lea  si, sizeofword[si+cx]          ; find least significant word  
    lea  di, sizeofword[di+cx]

; determine entry point into unrolled loop by taking counter modulo 8
    mov cx, si                          ;move n to bl - acts as a cursor
    shr cl, 1 
    jc  add_xx1
    je  done                            ; edge case: 0 words in value
add_xx0:
    shr cl, 1
    jc  add_x10
add_x00:
    shr cl, 1
    jnc add_000                         ; note carry flag is clear                         
;   clc                                
;   jmp add_100
    mov  ax, 0[si]                      
    add  0[di], ax                      ; do 1st add without carry
    lea  si, -1[si]
    lea  di, -1[di]
    jmp  add_011

add_x10:
    shr cl, 1
    jnc add_010
;   clc
;   jmp add_110
    mov  ax, 0[si]
    add  0[di], ax
    lea  si, -1[si]
    lea  di, -1[di]
    jmp  add_101

add_x01:
    shr cl, 1
    jnc add_001
;   clc
;   jmp add_101
    mov  ax, 0[si]
    adc  0[di], ax
    lea  si, -1[si]
    lea  di, -1[di]
    jmp  add_100

add_xx1:
    shr cl, 1
    jnc add_x01
add_x11:
    shr cl, 1
    jnc add_011
;   clc
;   jmp add_111

; the following code adds a fragment of an 8 word block
add_111: ; carry bit has value to propagate
    mov  ax, 0[si]         
;   adc  0[di], ax
    add  0[di], ax                             ; no carry in on 1st add
    lea  si, -1[si]
    lea  di, -1[di]
add_110:
    mov  ax, 0[si]
    adc  0[di], ax
    lea  si, -1[si]
    lea  di, -1[di]
add_101:
    mov  ax, 0[si]
    adc  0[di], ax
    lea  si, -1[si]
    lea  di, -1[di]
add_100:
    mov  ax, 0[si]
    adc  0[di], ax
    lea  si, -1[si]
    lea  di, -1[di]
add_011:
    mov  ax, 0[si]
    adc  0[di], ax
    lea  si, -1[si]
    lea  di, -1[di]
add_010:
    mov  ax, 0[si]
    adc  0[di], ax
    lea  si, -1[si]
    lea  di, -1[di]
add_001:
    mov  ax, 0[si]
    adc  0[di], ax
    lea  si, -1[si]
    lea  di, -1[di]
add_000:
    mov  ax, 0[si]
    adc  0[di], ax
    dec   cx                     ; does not disturb carry
    lea  si, -1[si]
    lea  di, -1[di]
    je    done

; unrolled loop here; very low overhead
add_8words: ; carry bit has value to propagate
    mov  ax, 0[si]
    adc  0[di], ax
    mov  ax, -1[si]
    adc  -1[di], ax
    mov  ax, -2[si]
    adc  -2[di], ax
    mov  ax, -3[si]
    adc  -3[di], ax
    mov  ax, -4[si]
    adc  -4[di], ax
    mov  ax, -5[si]
    adc  -5[di], ax
    mov  ax, -6[si]
    adc  -6[di], ax
    mov  ax, -7[si]
    adc  -7[di], ax
    dec   cx
    lea   si, -8[si]
    lea   di, -8[di]
    jne   add_8word
done: pop  cx
    ret

;---------------------------------------

The sequence

    mov  ax, 0[si]
    adc  0[di], ax
    lea  si, -1[si]
    lea  di, -1[di]

suggests perhaps using the single-word block move instructions as an alternative:

    std                          ; want to step backward
    ...
    lods
    adc  ax, 0[di]
    stos
    ...
    cld
    ret

with appropriate adjustments to the code, left to the reader.

Whether the loop I wrote or the LODS/STOS version is faster is something to measure carefully.

Chlorophyll answered 3/3, 2016 at 10:30 Comment(6)
Yeah, unrolling is a huge win with adc because looping is more expensive than normal. I was trying to keep it simple. See also https://mcmap.net/q/14460/-problems-with-adc-sbb-and-inc-dec-in-tight-loops-on-some-cpus, including the comments. I'm pretty sure the lods / stos version would be worse. lods and stos are both 3-uop or 3-mop instructions on Intel and AMD CPUs. (lodsd is a 2 uop instruction on Haswell and later, but lodsw is still 3). One store per clock should be the throughput bottleneck, but uop throughput would be the bottleneck with lods/stosMillikan
However, adc m, r/i is 4 uops with one per 2c throughput, even on Broadwell / Skylake where adc r, r/i is one uop with 1c latency. (Haswell and earlier have 2c latency adc, because a uop couldn't have more than two input dependencies. Haswell had FMA; Broadwell extended that 3-input uop functionality to adc. And adc r, m can't micro-fuse on BDW, so it's always 2 uops. However, on Haswell and earlier, adc r,m and adc r,r are both 2 uops. So load/adc r,m/store (with mov instructions) is probably optimal: 4 uops per adc, saturating the frontend where adc has 1c latency.Millikan
On Intel pre-Broadwell, you'll actually be limited by adc latency, to half that throughput. Also, nobody's mentioned the elephant in the room: 64bit code would be 4x faster, because each 64bit adc could do the work of four 16bit adc instructions.Millikan
I just posted this as an answer.Millikan
I stuck with 8086 because that's how OP posed it. I agree, a 64 bit version would be faster. (That's lots of great data about instruction costs).Chlorophyll
Thanks for taking the time to answer my question! I learned from your lea statements and ended up using them in my final solution.Pitchblack
M
2

If you want fast multi-precision addition, use 64bit code if at all possible. Doing 4x the width with every instruction directly gives a 4x speedup. On 386-compatible CPUs, you can use 32bit instructions and registers even in 16bit mode, which will give you a 2x speedup.


To take Ira's unroll suggestion a step further

  • reducing loop overhead by one lea
  • avoiding adc with a memory destination, which is slow on Intel.

    ... set up for the unrolled loop, with Ira's setup code
    ; di = dst pointer
    ; bx = src-dst, so bx+di = src
add_8words: ; carry bit has value to propagate
    ;sahf   ; another way to avoid a partial-flag stall while looping
    mov  ax, 0[bx+di]
    adc  ax, 0[di]
    mov  0[di], ax
    mov  ax, -1[bx+di]
    adc  ax, -1[di]
    mov  -1[di], ax
    mov  ax, -2[bx+di]
    adc  ax, -2[di]
    mov  -2[di], ax
    mov  ax, -3[bx+di]
    adc  ax, -3[di]
    mov  -3[di], ax
    mov  ax, -4[bx+di]
    adc  ax, -4[di]
    mov  -4[di], ax
    mov  ax, -5[bx+di]
    adc  ax, -5[di]
    mov  -5[di], ax
    mov  ax, -6[bx+di]
    adc  ax, -6[di]
    mov  -6[di], ax
    mov  ax, -7[bx+di]
    adc  ax, -7[di]
    mov  -7[di], ax

    lea   di, -8[di]
    ; lahf
    ; put the flag-setting insn next to the branch for potential macro-fusion
    dec   cx             ; causes a partial-flag stall, but only once per 8 adc
    jne   add_8word

This should run at nearly one adc per clock (minus loop overhead) on Intel Broadwell, and on AMD K8 through Bulldozer. (I forget if k8/k10 can do two loads per clock. That'll be the bottleneck if it can't). Using adc with a memory destination is not good on Intel, but fine on AMD, according to Agner Fog's tables. Intel Haswell and earlier will be limited by the 2c latency of adc. (Broadwell made adc and cmov single-uop instructions, making use of the 3-dependency uop support added in Haswell so FMA can be a single uop instruction).

A loop insn might be a win on older CPUs where a partial-reg stall is really bad, but other ways to avoid a partial-flag stall might be even better than the slow loop instruction.

Using the dest-source trick reduces the loop overhead of decrementing the pointer. 2-register addressing modes in loads don't need to micro-fuse, because pure mov loads are a single uop anyway. stores do need to micro-fuse, so you should prefer one-register addressing modes for stores. Haswell's extra store-address unit on port7 can only handle simple addressing modes, and 2-register addressing modes can't micro-fuse.

See also Problems with ADC/SBB and INC/DEC in tight loops on some CPUs for info about multiprecision adc loops and some experiments on Core2 vs. SnB CPUs for partial-flag stall performance.

Another way to loop here would be lea si, -1[si] / mov cx, si / jcxz. 16bit sucks, and you can't use [cx] in an effective address.

Millikan answered 3/3, 2016 at 11:27 Comment(4)
Oh, nice trick with the index registers. Old hand learns new trick.Chlorophyll
Thanks for taking the time to write all this out. I ended up using the loop and adc instructions - definitely learned something here. I know loop is supposed to be slow but this was running in dosbox.Pitchblack
@Harrison: DOSBOX is a pure emulator, right? It doesn't run your code natively at all? It does emulate a 386-compatible CPU (or even P6 with MMX/SSE?), so you could have used 32bit adc to do the same loop but with twice the data per iteration. All the flag-handling problems of adc loops are identical between 16bit and 32bit, though, except that you can use lea for no-flag adds on any register you like.Millikan
@PeterCordes I believe that is the case, yes. I'll keep that in mind in the future. Thanks again for all your guidance!Pitchblack

© 2022 - 2024 — McMap. All rights reserved.