How to calculate the sum of a sequence of powers of 2 in x86?
Asked Answered
V

1

0

I want to calculate a sequence of exponents of base 2. for example if given 3 the program would calculate (2^3 + 2^2 + 2^1 + 2^0). The problem is I cant calculate the exponent to being with.

I have tried shl to calculate the exponent and I have tried a loop. Neither produce the correct result, my latest attempt is below.

.586
.MODEL FLAT
INCLUDE io.h                            
.STACK 4096

.DATA

        Exponent    DWORD   ?           
        Prompt  BYTE    "Enter an exponent", 0

        Base        DWORD  2

        string      BYTE    40 DUP (?)

        resultLbl   BYTE    "The solution is", 0
        Solution    DWORD   20 DUP (?), 0

.CODE
_MainProc PROC

    input   Prompt, string, 40      ; read ASCII characters (N)
        atod    string              ; ascii to double
        mov     exponent, eax       ; store in memory     


        push Exponent
        push Base

        call function
        add esp, 8

        dtoa   Solution, eax         ; convert to ASCII characters
        output  resultLbl, Solution   ; output label and sum



                mov     eax, 0      ; exit with return code 0
        ret

_MainProc ENDP


;@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
; Procedure takes base and exponent from main

Function PROC
    push ebp                      ; stores base pointer
    mov  ebp, esp                 ; set base pointer to current position

    mov ebx, [ebp + 12]   ; read base (2)
    mov ecx,  [ebp + 8]       ; read N



    MainLoop:           ; loop to calculate expoent
          mul ebx           ; multiply base by its self
          add eax, ebx           ; save answer in eax
          dec ecx                ; decrement the exponent
          cmp ecx, 0             ; check if exponent is 0 yet
          je exit                ; Exit if exponent is 0
          jg mainloop            ; stay in loop if exponent is above 0

    Exit: 

        pop ebp                   ; restore base pointer 
        ret

function ENDP



END                  

I want the program to produce the correct exponent result, and if you can calculate the sequence above, for example if given 3 the program would calculate (2^3 + 2^2 + 2^1 + 2^0).

Validate answered 8/6, 2019 at 20:33 Comment(3)
Consult the instruction set reference about how mul works. Hint: it puts result in eax (and edx) so your add eax, ebx is pointless. Pick a different target register and add eax to it. Also you forgot to initialize eax so you are just multiplying some random number. PS: learn to use a debugger.Epner
2^3 + 2^2 + 2^1 + 2^0 = 2^4-1, 2^n + 2^(n-1) + ... + 2^1 + 2^0 = 2^(n+1) - 1Feeder
Your edit destroyed most of the question, now it's just asking about an integer power loop, not a sum of powers, and not specifically powers of 2 anymore. And you didn't even link or describe the Microsoft doc you found. Rolling it back. If you want to post the link in a comment, go ahead. (Link-only answers are not welcome on stack overflow, so if you want to post an answer it has to be at least a minimal answer on its own in case the link rots.)Clary
C
4

Base 2 is an extremely special case because computers use binary integers.

2^n = 1 << n i.e. 2n = 1 left-shifted n times. i.e. that bit-position is set.

You're asking for 2^0 + 2^1 + ... 2^n. A number with all bits set up to an including bit n.
That's 1 less than 2^(n+1), so the number you want is 2^(n+1) - 1.


x86 has an instruction to set a bit at a given position in a register: bts reg, reg = Bit Test and Set. On Intel CPUs, the reg-reg form decodes to a single uop, with 1-cycle latency. (https://agner.org/optimize/) It also puts the old value of the bit in CF, but normally we don't care about that when using it for this. Normally we just zero a destination register and use BTS to set a single bit.

On Ryzen, bts reg,reg is 2 uops and variable-count shifts are single-uop, so mov edx, 1 / shl edx, cl is good, too. But that's worse for code-size, and more uops on Intel CPUs.

On Intel Sandybridge-family CPUs, shl reg, cl decodes to 3 uops. BMI2 shlx reg, reg, reg is 1 uop, but BMI2 support is still far from being something you can assume, unfortunately.

(Memory-destination BTS with a register source is very slow, treating the source as an indexing into an arbitrary length bit-string, not just the addressed dword, so normally avoid it for performance.)


Calculating 2^(n+1) - 1 without overflow

(C version of this answer on another question, with compiler output.)

If we just added 1 to the input number before feeding it to BTS, it could wrap around and fail to work for 31 (which should set all bits, but would instead set bit 32%32 = 0)

Since we need one extra instruction after reading the input anyway, we can use x86's shift-and-add instruction (LEA) to do one more shift as we subtract 1. So with n=31 we start with the high bit set, and shift it out to get zero. Subtracting 1 then sets all the bits, as desired

xor    edx,edx                 ; edx = 0
bts    edx, eax                ; edx |=  1<<eax
lea    edx, [edx + edx - 1]    ; edx = (edx<<1) - 1

The logic for this goes as follows

n      BTS result (2^n)    *2 - 1
0    ->       1         ->   1        = 2 - 1
1    ->       2         ->   3        = 4 - 1
2    ->       4         ->   7        = 8 - 1
3    ->       8         ->  15        = 16 - 1
...
30   ->  0x40000000     ->  0x7FFFFFFF  = 0x80000000 - 1
31   ->  0x80000000     ->  0xFFFFFFFF  = 0 - 1

It's not a coincidence that the last column is a running total of the 2nd column.

LEA with 3 "components" to the addressing mode has extra latency vs. simpler LEAs, e.g. 3-cycle latency on Sandybridge-family vs. 1 cycle. But it's still a single uop so it's a great choice for throughput.

If we really wanted to optimize, and didn't have to worry about the n=31 case overflowing, we'd write the ASCII -> int loop by hand but start with a total of 1 instead of 0, to fold the n+1 into finding n. Then bts would give us our 2^(n+1), and we could simply dec that.


There's no need to store exponent to memory and reload it again; you already have it in a register where you want it.

Your comment on the atod line is wrong, and your code wouldn't make sense if it was right. ASCII to double (like the C function strtod) would return in x87 st0 or SSE2 xmm0, not EAX. atod actually stands for ASCII (decimal) to integer. Perhaps the d stands for DWORD.

    input   Prompt, string, 40      ; read ASCII characters (N)
    atod    string                 ; ascii (decimal) to integer in EAX

    xor    edx,edx                 ; edx = 0
    bts    edx, eax                ; edx |=  1<<eax
    lea    edx, [edx + edx - 1]    ; edx = (edx<<1) - 1

    dtoa    Solution, edx

It takes 2 instructions to implement 1<<n so it's silly to put it into its own function. Just passing args + calling the function would have take as much work as simply using bts.

0 -> 1 -> 1
1 -> 2 -> 3
2 -> 4 -> 7

Compilers typically use mov edx,1 + mov ecx, eax + shl edx, cl. But that's a missed optimization, especially on Intel CPUs.

BMI2 shlx edx, edx, ecx would help (avoiding the mov), but is worse code-size than xor-zero + bts. And BMI2 isn't always available.

If you can't get your compiler to use BTS (or you have BMI2 for SHLX), another good implementation is (2ULL << n) - 1 so you bake the extra shift into the number you start shifting. So a count of 31 can shift the bit out and produce 0.

Clary answered 8/6, 2019 at 21:0 Comment(8)
@MichaelPetch: thanks, fixed to clean up all that rambling guesswork and just explain that it's not double.Clary
You should add common x86 math instructions to ppcg x86 tips. Also isn't the shortcut computation 2^(n+1) - 1Feeder
@qwr: IMO Tips for golfing in x86/x64 machine code is cluttered enough with things that are only relevant to size optimization. I don't think we should write answers for every micro-optimization that compilers already do for performance reasons. (Although compilers do often miss BTS.)Clary
@qwr: oh right, the question title misled me, they are asking about the sum of all powers of 2, not just one power of 2.Clary
is this correct: xor + bts is 1+3 bytes, xor + inc + shl is 1+1+2 bytes?Feeder
@qwr: no, xor reg,reg is 2 bytes. BTS is 3. SHL reg,cl is 2 bytes. (Creating an arbitrary small constant costs 3 bytes, e.g. with push imm8/pop, LEA from a known constant, or xor-zero + inc/dec). Anyway, fixed my answer to do 2^(n+1)-1 safely with only 1 more instruction! This answer is not attempting to focus on code-golf, only code-size as a tie-breaker when performance is equal.Clary
there are many ways to calculate 2^(n+1) - 1 without overflow in Creating a mask with N least significant bits setCholi
@phuclv: those are all more complicated and thus worse if we don't need 0 as a possible output. But it's an interesting related problem. I posted a (non-)answer there using this trick because it's highly related. I should really submit a bug report to GCC and clang about using BTS; it's missed optimization that seems to come up fairly often, especially for x |= 1<<n; where it's even more of a missed opt.Clary

© 2022 - 2024 — McMap. All rights reserved.